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Preface 


After decades of development, automation and intelligence increase significantly in 
the process industry, and key technologies continue to make breakthroughs. In the era 
of “New Industrial Revolution”, it is of great significance to use modern information 
technology to promote intelligent manufacturing with the goal of safety, efficiency, 
and green. Obviously, safety has always been the lifeline of intelligent and optimized 
manufacturing in process industries. 

With the increasing requirements for production safety and quality improvement, 
process monitoring and fault diagnosis have gained great attention in academic 
research and even in industrial applications. The widespread use of sensor networks 
and distributed control systems have facilitated access to a wealth of process 
data. How to effectively use the data generated during the production process 
and the process mechanism knowledge for process monitoring and fault diagnosis 
is a topic worth exploring for the large and complex process industrial systems. 
Fruitful academic results have been produced recently and widely used in the actual 
production process. 

The authors of this book have devoted themselves to the theoretical and applied 
research work on data-driven industrial process monitoring and fault diagnosis 
methods for many years. They are deeply concerned with the flourishing development 
of data-driven fault diagnosis techniques. This book focuses on both multivariate 
statistical process monitoring (MSPM) and Bayesian inference diagnosis. It intro- 
duces the basic multivariate statistical modeling methods, as well as the authors’ latest 
achievements around the practical industrial needs, including multi-transition process 
monitoring, fault classification and identification, quality-related fault detection, and 
fault root tracing. 

The main contributions given in this book are as follows: 

(1) Soft-transition-based high-precision monitoring for multi-stage batch 
processes: Most batch processes obviously have several operation stages with 
different process characteristics. In addition, their data present obvious three- 
dimensional features with strong nonlinearity and time variability. So it is difficult to 
apply multivariate statistical methods directly to the monitoring of batch processes. 
This book proposes a soft-transition-based fault detection method. First, a two-step 
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stage division method based on Support Vector Data Description (SVDD) is given, 
then a dynamic soft-transition model of transition stages is constructed; finally, the 
monitoring in the original measurement space is given for each stage. Compared 
with the traditional method, the advantages of the proposed method are reflected 
in the following techniques: improvement of soft-transition process design, statistic 
decomposition, and fusion indicator monitoring. It just greatly increase the accuracy 
of batch process fault detection. 

(2) Fault classification and identification for batch process with variable produc- 
tion cycle: Batch processes inevitably are subject to the changes in initial condi- 
tions and the external environment, which can cause changes in production cycles. 
However, current monitoring methods for batch processes generally require the equal 
production cycle and a complete production trajectory. Therefore, variable cycle 
and unknown values estimation in complete trajectory become the bottleneck for 
improving the diagnostic performance. This book gives a fault diagnosis method for 
batch processes based on kernel Fisher envelope analysis. It builds envelope surface 
models for normal conditions and all known fault condition, respectively. Then online 
fault diagnosis strategy is proposed based on these surface models. Further, the 
fusion of kernel Fisher envelope analysis and PCA is proposed for fault diagnosis 
of batch process. It effectively solves the fault classification and identification of 
unequal-length batch production process. 

(3) Quality-related fault detection with fusion of global and local features: The key 
of manufacture is to guarantee the final product quality, yet it is difficulty or extreme 
cost to acquire quality information in real time. Therefore, it is great practical to 
monitor the process variables that have an impact on the final quality output in roder 
to further enable quality-related fault detection and diagnosis. This book proposes an 
idea of quality-related projection with the fusion of global and local features to obtain 
the correlation between quality variables and process variables. It is well known that 
the partial least squares projection algorithm looks for global structural change infor- 
mation based on the process covariance maximization direction. The local preserva- 
tion projection, or manifold learning approach can exactly maintain the local neigh- 
borhood structure and achieve nonlinear mapping by using linear approximation. 
The proposed fusion approach constructs potential geometric structures containing 
both global and local information, extracts meaningful low-dimensional structural 
information to represent the relationship between high-dimensional process variables 
and quality data. Thus, it effectively achieves the detection of quality-related faults 
for strongly nonlinear and strongly dynamic processes. 

(4) Bayesian fault diagnosis and root tracing combined with process mechanism: 
Due to the complex interrelationships among system components, same fault source 
may have different manifestation signs in the different process variables. The tradi- 
tional contribution graph in multivariate statistical monitoring is inefficiency in fault 
root tracing. This book proposes an uncertainty knowledge expression inference 
model, named probabilistic causal graph model, based on probability theory and 
graph theory. It intuitively and accurately reveals the qualitative and quantitative 
relationships between process variables. Then a framework for fault diagnosis and 
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root tracking based on the proposed model is given. Different modeling and infer- 
ence techniques are given for the discrete and continuous system, respectively. So, the 
inference can perform real-time dynamic analysis of discrete alarm states or contin- 
uous process variables. The forward inference predicts the univariate and multivariate 
alarms or fault events, while the reverse implements the accurate fault root tracing 
and localization. 

The book consists of 14 chapters divided into four parts: 

Part I, Chaps. 1-4, is devoted to mathematical background. Chapter 1 gives the 
basic knowledge about process monitoring measure, common detection indicator, 
and its control limit. Chapters 2—3 focus on the basic multivariate statistical methods, 
including principal element analysis (PCA), partial least squares (PLS), canonical 
correlation analysis (CCA), canonical variable analysis (CVA), and Fisher discrim- 
inant analysis (FDA). To help readers learn the above theoretical methods, Chap. 4 
gives a detailed introduction to the Tennessee Eastman (TE) continuous chem- 
ical simulation platform and the penicillin semi-batch reaction simulation platform. 
Readers can collect appropriate process data and conduct corresponding simulation 
experiments on these simulation platform. 

Part II, Chaps. 5—8, are organized around the main contributions 1 and 2 of this 
book. Various improved fault detection and identification methods are given for 
batch process. Chapters 5—6 are given for contribution | aiming at the high-precision 
process monitoring of with many stages process, based on Support Vector Data 
Description (SVDD) soft-transition process, and fusion index design based on statis- 
tics decomposition. Chapters 7—8 are given for contribution 2 aiming at the fault iden- 
tification for complex batch process with unequal cycle, based on the kernel Fisher 
envelope surface analysis and local linear embedded Fisher discriminant analysis, 
respectively. 

Part III, Chaps. 9-12, are organized around the main contribution 3 of this book. 
To improve the statistical model between process variables and quality variables 
with nonlinear correlation, two different strategies are considered. First, under the 
idea of global and local feature fusion, the manifold structure are considered to 
extract the nonlinear correlations between them effectively. A unified framework of 
spatial optimization projection is constructed based on the effective fusion of two 
types of performance indices, global covariance maximization and local adjacency 
structure minimization. A variety of different performance combinations are given 
in Chaps. 9-11: QGLPLS, LPPLS and LLEPLS, respectively. Another strategy is to 
consider the nonlinearity as uncertainty, then robust L;-PLS is proposed in Chap. 12. 
It enhances the robustness of PLS method based on the latent structure regression 
with Lı. The effectiveness and applicability of the above combination methods are 
discussed. 

Part IV, Chaps. 13—14, are organized around the main contribution 4 of the book. 
The known industrial process flow structure is integrated with the industrial data 
analytic, and the qualitative causal relationships among process variables are estab- 
lished by multivariate causal analysis methods. The quantitative causal dependencies 
among process variables are characterized by conditional probability density esti- 
mation under this network structure. So, Bayesian causal probability graph model 
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of complex systems is realized for process variable failure prediction and reverse 
tracing. The specific implementation of the Bayesian inference, respectively, in 
discrete alarm variable analysis and continuous process variable analysis are given 
in this book. 

Fault detection and diagnosis (FDD) is one of the core topics in modern complex 
industrial processes. It attracts the attention of scientists and engineers from various 
fields such as control, mechanics, mathematics, engineering, and automation. This 
book gives an in-depth study of various data-driven analysis methods and their appli- 
cations in process monitoring, especially for data modeling, fault detection, fault 
classification, fault identificatoin, and fault reasoning. Oriented toward the industrial 
big data analytic and industrial artificial intelligence, this book integrates multivariate 
statistical analysis, Bayesian inference, machine learning, and other intelligent anal- 
ysis methods. This book attempts to establish a basic framework of complex industrial 
process monitoring suitable for various types of industrial data processing, and gives 
a variety of fault detection and diagnosis theories, methods, algorithms, and various 
applications. It provides data-driven fault diagnosis techniques of interest to advanced 
undergraduate and graduate students, researchers in the direction of automation and 
industrial safety. It also provides various applications of engineering modeling, data 
analysis, and processing methods for related practitioners and engineers. 
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Chapter 1 A) 
Background rie 


1.1 Introduction 


Fault detection and diagnosis (FDD) technology is a scientific field emerged in the 
middle of the twentieth century with the rapid development of science and data 
technology. It manifests itself as the accurate sensing of abnormalities in the man- 
ufacturing process, or the health monitoring of equipment, sites, or machinery in a 
specific operating site. FDD includes abnormality monitoring, abnormal cause iden- 
tification, and root cause location. Through qualitative and quantitative analysis of 
field process and historical data, operators and managers can detect alarms that affect 
product quality or cause major industrial accidents. It is help for cutting off failure 
paths and repairing abnormalities in a timely manner. 


1.1.1 Process Monitoring Method 


In general, FDD technique is divided into several parts: fault detection, fault isolation, 
fault identification, and fault diagnosis (Hwang et al. 2010; Zhou and Hu 2009). Fault 
detection is determining of the appearance of fault. Once a fault (or error) has been 
successfully detected, damage assessment needs to be performed, i.e., fault isolation 
(Yang et al. 2006). Fault isolation lies in determining the type, location, magnitude, 
and time of the fault (i.e., the observed out-of-threshold variables). It should be noted 
that fault isolation is not to isolation of specific components of a system with the 
purpose of stopping errors from propagating. In a sense, fault identification may have 
been a better choice. It also has the ability to determine its timely change. Isolation 
and identification are commonly used in the FDD process without strict distinction. 
Fault diagnosis determines the cause of the observed out-of-threshold variables in 
this book, so it is called as fault root tracing. During the process of fault tracing, 
efforts are made to locate the source of the fault and find the root cause. 
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Fig. 1.1 Classification of fault diagnosis methods 


FDD involves control theory, probability statistics, signal processing, machine 
learning, and many other research areas. Many effective methods have been devel- 
oped, and they are usually classified into three categories, knowledge-based, analyt- 
ical, and data-driven (Chiang et al. 2001). Figure 1.1 shows the classification of fault 
diagnosis methods. 


(1) Analytical Method 

The analytical model of the engineering system is obtained based on the mathemati- 
cal and physical mechanism. Analytical model-based method represents to monitor 
the process real time according to the mathematical models often constructed from 
first principles and physical characteristics. Most analytical measures contain state 
estimation (Wang et al. 2020), parameter estimation (Yu 1997), parity space (Ding 
2013), and analytical redundancy (Suzuki et al. 1999). The analytical method appears 
to be relatively simple and usually is applied to systems with a relatively small num- 
ber of inputs, outputs, and states. It is impractical for modern complex system since it 
is not easy to establish an accurate mathematical model due to its complex character- 
istics such as nonlinearity, strong coupling, uncertainty, and ultra-high-dimensional 
input and output. 

(2) Knowledge-Based Method 

Knowledge-based fault diagnosis does not require an accurate mathematical model. 
Its basic idea is to use expert knowledge or qualitative relationship to develop the fault 
detection rules. The common approaches mainly include fault tree diagnosis (Hang 
et al. 2006), expert system diagnosis (Gath and Kulkarn 2014), directed graphs, fuzzy 
logic (Miranda and Felipe 2015), etc. The application of knowledge-based models 
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strongly relies on the complete process empirical knowledge. Once the information of 
the diagnosed object is known from expert experience and historical data, a variety of 
rules for appropriate reasoning is constructed. However, the accumulation of process 
experience and knowledge are time-consuming and even difficult. Therefore, this 
method is not universal and can only be applied to engineering systems which people 
are familiar with. 

(3) Data-Driven Method 

Data-driven method is based on the rise of modern information technology. In fact, 
it involves a variety of disciplines and techniques, including statistics, mathematical 
analysis, and signal processing. Generally speaking, the industrial data in the field are 
collected and stored by intelligent sensors. Data analysis can mine the hidden infor- 
mation contained in the data, establish the data model between input and output, help 
the operator to monitor the system status in real time, and achieve the purpose of fault 
diagnosis. Data-driven fault diagnosis methods are be divided into three categories: 
signal processing-based, statistical analysis-based, and artificial intelligence-based 
(Zhou et al. 2011; Bersimis et al. 2007). The commonality of these methods is 
that high-dimensional variables are projected into the low-dimensional space with 
extracting the key features of the system. Data-driven method does not require an 
accurate model, so is more universal. 

Both analytical techniques and data-driven methods have their own merits, but also 
have certain limitations. Therefore, the fusion-driven approach combining mecha- 
nistic knowledge and data could compensate the shortcomings of a single technique. 
This book explores the fault detection, fault isolation/identification, and fault root 
tracing problems mainly based on the multivariate statistical analysis as a mathemat- 
ical foundation. 


1.1.2 Statistical Process Monitoring 


Fault detection and diagnosis based on multivariate statistical analysis has developed 
rapidly and a large number of results have emerged recently. This class of method, 
based on the historical data, uses multivariate projection to decompose the sample 
space into a low-dimensional principal element subspace and a residual subspace. 
Then the corresponding statistics are constructs to monitor the observation variables. 
Thus, this method also is called latent variable projection method. 


(1) Fault Detection 

The common multivariate statistical fault detection methods include principal com- 
ponent analysis (PCA), partial least squares (PLS), canonical correlation analysis 
(CCA), canonical variables analysis (CVA), and their extensions. Among them, PCA 
and PLS, as the most basic techniques, are usually used for monitoring processes 
with Gaussian distributions. These methods usually use Hotelling’s T? and Squared 
Prediction Error (SPE) statistics to detect variation of process information. 


4 1 Background 


It is worth noting that these techniques extract the process features by max- 
imizing the variance or covariance of process variables. They only utilize the 
information of first-order statistics (mathematical expectation) and second-order 
statistics (variance and covariance) while ignoring the higher order statistics (higher 
order moments and higher order cumulants). Actually, there are few processes in 
practice that are subject to the Gaussian distribution. The traditional PCA and PLS 
are unable to extract effective features from non-Gaussian processes due to omitting 
the higher order statistics. It reduces the monitoring efficiency. 

Numerous practical production conditions, such as strong nonlinearity, strong 
dynamics, and non-Gaussian distribution, make it difficult to directly apply the basic 
multivariate monitoring methods. To solve these practical problems, various extended 
multivariate statistical monitoring methods have flourished. For example, to deal 
with the process dynamics, dynamic PCA and dynamic PLS methods have been 
developed, which take into account the autocorrelation and cross-correlation among 
variables (Li and Gang 2006). To deal with the non-Gaussian distribution, indepen- 
dent component analysis (ICA) methods have also been developed (Yoo et al. 2004). 
To deal with the process nonlinearity, some extended kernel methods such as kernel 
PCA (KPCA), kernel PLS (KPLS), and kernel ICA (KICA) have emerged (Cheng 
et al. 2011; Zhang and Chi 2011; Zhang 2009). 

(2) Fault Isolation or Identification 

A common approach for separating faults is the contribution plot. Itis an unsupervised 
approach that uses only the process data to find fault variables and does not require 
other prior knowledge. Successful separation based on the contribution plot includes 
the following properties: (1) each variable has the same mean value of contribution 
under the normal operation and (2) the faulty variables have very large contribution 
values under the fault conditions, compared with other normal variables. Alcala 
and Qin summarized the commonly contribution plot techniques, such as complete 
decomposition contributions (CDC), partial decomposition contributions (PDC), and 
reconstruction-based contributions (RBC) (Alcala and Qin 2009, 2011). 

However, contribution plot usually suffers from the smearing effect, a situation in 
which non-faulty variables show larger contribution values, while the contribution 
values of the fault variables are smaller. Westerhuis et al. pointed out that one variable 
may affect other variables during the execution of PCA, thus creating a smearing 
effect (Westerhuis et al. 2000). Kerkhof et al. analyzed the smearing effect in three 
types of contribution indices, CDC, PDC, and RBC, respectively (Kerkhof et al. 
2013). It was pointed that smearing effect is caused by the compression and expansion 
operations of variables from the perspective of mathematical decomposition. So it 
cannot be avoided during the transformation of data from measurement space to latent 
variable space. In order to eliminate the smearing effect, several new contribution 
indices are given based on dynamically calculating average value of the current and 
previous residuals (Wang et al. 2017). 

If the historical data collected have been previously categorized into separate 
classes where each class pertains to a particular fault, fault isolation or identification 
can be transformed into pattern classification problem. The statistical methods, such 
as Fisher’s discriminant analysis (FDA) (Chiang et al. 2000), have also been success- 
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fully applied in industrial practice to solve this problem. It assigns the data into two or 
more classes via three steps: feature extraction, discriminant analysis, and maximum 
selection. If the historical data have not been previously categorized, unsupervised 
cluster analysis may classify data into separate classes accordingly (Jain et al. 2000), 
such as the K-Means algorithm. More recently, neural network and machine learning 
techniques imported from statistical analysis theory have been receiving increasing 
attention, such as support vector data description (SVDD) covered in this book. 

(3) Fault Diagnosis or Root Tracing 

Fault root tracing based on Bayesian network (BN) is a typical diagnostic method 
that combines the mechanism knowledge and process data. BN, also known as prob- 
abilistic network or causal network, is a typical probabilistic graphical model. Since 
the end of last century, it has gradually become a research hotspot due to its superior 
theoretical properties in describing and reasoning about uncertain knowledge. BN 
was first proposed by Pearlj, a professor at the University of California, in 1988, to 
solve the problem of uncertain information in artificial intelligence. BN represents 
the relationships between the causal variable is the form of directed acyclic graphs. 
In the fault diagnosis process of an industrial system, the observed variable is used as 
node containing all the information about the equipment, control quantities, and faults 
in the system. The causal connection between variables is quantitatively described 
as a directed edge with the conditional probability distribution function (Cai et al. 
2017). Fault diagnosis procedure with BNs consists of BN structure modeling, BN 
parameter modeling, BN forward inference, and BN inverse tracing. 

In addition to the probabilistic graphical model such as BN, the development 
of other causal graphical model has developed vigorously. These progresses aim at 
determining the causal relationship among the operating units of the system based 
on hypothesis testing (Zhang and Hyvärinen 2008; Shimizu et al. 2006). The gener- 
ative model (linear or nonlinear) is built to explain the data generation process, i.e., 
causality. Then the direction of causality is tested under some certain assumptions. 
The most typical one is the linear non-Gaussian acyclic model (LiNGAM) and its 
improved version (Shimizu et al. 2006, 2011). It has the advantage of determining 
the causal structure of variables without pre-specifying their causal order. All these 
results are serving as a driving force for the development of probabilistic graphical 
model and playing a more important role in the field of fault diagnosis. 


1.2 Fault Detection Index 


The effectiveness of data-driven measures often depends on the characterization of 
process data changes. Generally, there are two types of changes in process data: 
common and special. Common changes are entirely caused by random noise, while 
specials refer to all data changes that are not caused by common causes, such as 
impulse disturbances. Common process control strategies may be able to remove 
most of the data changes with special reasons, but these strategies cannot remove 
the common cause changes inherent in the process data. As process data changes 
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are inevitable, statistical theory plays an important role in most process monitoring 
programs. 

By defining faults as abnormal process conditions, it is easy to know that the 
application of statistical theory in the monitoring process actually relies on a reason- 
able assumption: unless the system fails, the data change characteristics are almost 
unchanged. This means that the characteristics of data fluctuations, such as mean and 
variance, are repeatable for the same operating conditions, although the actual value 
of the data may not be very predictable. The repeatability of statistical attributes 
allows automatic determination of thresholds for certain measures, effectively defin- 
ing out-of-control conditions. This is an important step to automate the process 
monitoring program. Statistical process monitoring (SPM) relies on the use of nor- 
mal process data to build process model. Here, we discuss the main points of SPM, 
i.e., fault detection index. 

In multivariate process monitoring, the variability in the residual subspace (RS) 
is represented typically by squared sum of the residual, namely the Q statistic or the 
squared prediction error (SPE). The variability in the principle component subspace 
(PCS) is represented typically by Hotelling’s T? statistic. Owing to the complemen- 
tary nature of the two indices, combined indices are also proposed for fault detection 
and diagnosis. Another statistic that measures the variability in the RS is Hawkins’ 
statistic (Hawkins 1974). The global Mahalanobis distance can also be used as a 
combined measure of variability in the PCS and RS. Individual tests of PCs can also 
be conducted (Hawkins 1974), but they are often not preferred in practice, since one 
has to monitor many statistics. In this section, we summarize several fault detection 
indices and provide a unified representation. 


1.2.1 T? Statistic 


Consider the sampled data with m observation variables x = [x1, x2, . . . , Xm] and n 
observations for each variable. The data are stacked into a matrix X € R”*™, given 
by 


X11 X12 °t Xim 
X21 X22 *** X2m 

X=]. . ME (1.1) 
Xn1 Xn2 ` `° Xnm 


firstly, the matrix X is scaled to zero mean, and the sample covariance matrix is equal 
to 


1 


n—1 


S= 


XTX. (1.2) 


An eigenvalue decomposition of the matrix S, 
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S= PAP" = [P P] diag{A, A} [P PT. (1.3) 


The correlation structure of the covariance matrix S is revealed, where P is orthog- 
onal. (P PT = I, in which, Z is the identity matrix) (Qin 2003) and 


1 
A= yf = diag{h, A2,..., Ax} 
n= 
= 1 >T . 
A= i T = diag{àk + 1, àk +2,...,Am} 
m 
k m 
MZM àm X u> DO ay 
i=l j=k+1 
À l t't (ti) 
i= i ti ~ var(t; 
N-1' 


when n is very large. The score vector t; is the i-th column of T = [T, T]. The PCS is 
Sp = span{ P} and the RS is S, = span{ P}. Therefore, the matrix X is decomposed 
into a score matrix T and a loading matrix P =[P, P], that is 


X=TP’=%+X=7P "+ TP =XPP"+4X(I-— PP”), (1.4) 


The sample vector x can be projected on the PCS and RS, respectively: 


x=k4+% (1.5) 
X= PP'x (1.6) 
& = PP'x = (I - PP")x. (1.7) 


Assuming S is invertible and with the definition 
ST 
Z=A?P x. (1.8) 
The Hotelling’s T? statistic is given by Chiang et al. (2001) 
T?=z'z=x'PA'P'x. (1.9) 


The observation vector x is projected into a set of uncorrelated variables y by 
y = P'x. The rotation matrix P directly from the covariance matrix of x guarantees 
that y is correspond to x. A scales the elements of y to produce a set of variables with 
unit variance corresponding to the elements of z. The conversion of the covariance 
matrix is demonstrated graphically in Fig. 1.2 for a two-dimensional observation 
space (m = 2) (Chiang et al. 2001). 

The T? statistic is a scaled squared 2-norm of an observation vector x from its 
mean. An appropriate scalar threshold is used to monitor the variability of the data in 
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AX y2 Z2 


Fig. 1.2 A graphical illustration of the covariance conversion for the T? statistic 
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the entire m-dimensional observation space. It is determined based on an appropriate 
probability distribution with given significance level a. In general, it is assumed that 


e the observations are randomly sampled and subject to a multivariate normal dis- 
tribution. 

e the mean vector and covariance matrix of observations sampled in the normal 
operations are equal to the actual ones, respectively. 


Then the T? statistic follows a x? distribution with m degrees of freedom (Chiang 
et al. 2001), 


T? = xm). (1.10) 


The set T? < TŽ is an elliptical confidence region in the observation space, as 
illustrated in Fig. 1.3 for two process variables. This threshold (1.10) is applied to 
monitor the unusual changes. An observation vector projected within the confidence 
region indicates process data are in-control status, whereas outside projection indi- 
cates that a fault has occurred (Chiang et al. 2001). 

When the actual covariance matrix for the normal status is not known but instead 
estimated from the sample covariance matrix (1.2), the threshold for fault detection 
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is given by 


T? = PODRED Bonn m), (1.11) 


where F,(m,n — m) is the upper 100a@% critical point of the F-distribution with 
m and n — m degrees of freedom (Chiang et al. 2001). For the same significance 
level a, the upper in-control limit in (1.11) is larger (more conservative) than that in 


(1.10). The two limits approach each other when the amount of observation increases 
(n — oo) (Tracy et al. 1992). 


1.2.2 Squared Prediction Error 


The SPE index measures the projection of the sample vector on the residual subspace: 
SPE := ||¥|? = || — PP*)x|/. (1.12) 

The process is considered as normal if 
SPE < 82, (1.13) 


where 52 denotes the upper control limit of SPE with a significant level of œ. Jackson 
and Mudholkar gave an expression for 52 (Jackson and Mudholkar 1979) 


> 1/ho 
pag | |, atan 
1 


= , 1.14 
: 5 p (1.14) 
where 
= J, i=1,2,3, (1.15) 
j=k+1 
N ae (1.16) 
g= 303 , ° 


where k is the number of retained principal components and Zg is the normal deviation 
corresponding to the upper percentile of 1 — a. Note that the above result is obtained 
under the following conditions. 


e The sample vector x follows a multivariate normal distribution. 
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e In deriving the control limits, an approximation is made to this distribution that is 
valid when 6, is very large. 

e This result holds regardless of the number of principal components retained in the 
model. 


When a fault occurs, the fault sample vector x consists of the normal part super- 
imposed on the faulty part. The fault causes the SPE to be larger than the threshold 
62, which results in the fault being detected. 

Nomikos and MacCregor (1995) used the results in Box (1954) to derive an 
alternative upper control limit for SPE. 


oy = Pia (1.17) 
where 
g=62/0, h=6?/6. (1.18) 


The relationship between SPE threshold (1.14) and (1.17) is as follows: Nomikos 
and MacCregor (1995) 


3 
s2~ gh(1- = + 
ee on V On 


1.2.3 Mahalanobis Distance 
Define the following Mahalanobis distance which forms the global Hotelling’s T? 
test: 


m(n? — 1) 


n(n — m) 


D=xX's'!x~ Foun is (1.19) 
where S is the sample covariance of X. When S is singular with rank(S) =r < 


m, Mardia discusses the use of the pseudo-inverse of S, which in turn yields the 
Mahalanobis distance of the reduced-rank covariance matrix (Brereton 2015): 


2 
-1 
D, = X'St*X ~ ESD 2 (1.20) 
n(n=r) `’ 
where St is the Moore-Penrose pseudo-inverse. It is straightforward to show that 


the global Mahalanobis distance is the sum of T? in PCS and Ta = x™PA | P'x 
(Hawkins’ statistic Hawkins 1974) in RS: 


D=T +T}. (1.21) 
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When the number of observations n is quite large, the global Mahalanobis distance 
approximately obeys the x? distribution with m degrees of freedom: 


D~ x2. (1.22) 
Similarly, the reduced-rank Mahalanobis distance follows: 
D, ~ x. (1.23) 


Therefore, faults can be detected using the correspondingly defined control limits 
for D and D,. 


1.2.4 Combined Indices 


In practice, better monitoring performance can be achieved in some cases by using a 
combined index instead of two indices to monitor the process. Yue and Qin proposed 
a combined index for fault detection that combines SPE and T? as follows: Yue and 
Qin (2001): 


_ SPE(X) TX) 


2 2 
ôg Xia 


= X'@X, (1.24) 


where 


p PAP"  aPr” Pare PP' 


Xa ô? x a ô 


(1.25) 


Notice that ® is symmetric and positive definite. To use this index for fault detec- 
tion, the upper control limit of g is derived from the results of Box (1954), which 
provides an approximate distribution with the same first two moments as the exact 
distribution. Using the approximate distribution given in Box (1954), the statistical 
data ọ is approximated as follows: 


p= XX ~ gyx?, (1.26) 
where the coefficient 
tr(S®)2 
= tr(S®)" (1.27) 
tr(S®) 


and the degree of freedom for x distribution is 
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[tr(S®))? 


in which, 


l 
tr(S®) = —— + (1.29) 
Xia 83 
MY Ae 
tr(S®)? = — + Diets Ai (1.30) 
Xia ôg 


After computing g and h, for a given significance level œ, a control upper limit 
for g can be obtained. A fault is detected by ¢ if 


P > 8X (1.31) 


It is worth noting that Raich and Cinar suggest another combined statistic (Raich 
and Cinar 1996), 


SPE(X) T*(X) 
52 +(1-c) ka 


, (1.32) 


where c € (0, 1) is a constant. They further give a rule that the statistic less than 1 
is considered normal. However, this may lead to wrong results because even if the 
above statistic is less than 1, it is possible that SPE(X) > 82 or T?(X) > Xia (Qin 
2003). 


1.2.5 Control Limits in Non-Gaussian Distribution 


Nonlinear characteristics are the hotspot of current process monitoring research. 
Many nonlinear methods such as kernel principal component, neural network, and 
manifold learning are widely used in the component extraction of process monitoring. 
The principal component extracted by such methods may be independent of the 
Gaussian distribution. Thus, the control limits of the T? and Q statistical series 
are calculated by the probability density function, which can be estimated by the 
nonparametric kernel density estimation (KDE) method. The KDE applies to the 
T? and Q statistics because they are univariate although the processes represented 
by these statistics are multivariate. Therefore, the control limits for the monitoring 
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statistics (T? and SPE) are calculated from their respective PDF estimates, given by 


Thy a 
f g(T?)dT’? =a 

toes (1.33) 
f g(SPE)dSPE = a, 


(oe) 


where 


K denotes a kernel function and h denotes the bandwidth or smoothing parameter. 
Finally, the fault detection logic for the PCS and RS is as follows: 


T? > Thr a or Tspg > Thspp.g, Faults T 
T? < Thr and Tspg < Thspgw, Fault-free. ` 
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Chapter 2 A) 
Multivariate Statistics in Single get 
Observation Space 


The observation data collected from continuous industrial processes usually have 
two main categories: process data and quality data, and the corresponding industrial 
data analysis is mainly for the two types of data based on the multivariate statistical 
techniques. Process data are collected by distributed control system (DCS) in real 
time with frequent sampling (its basic sampling period usually is 1s). For example, 
there are five typical variables in the process industries: temperature, pressure, flow 
rate, liquid level, and composition. Among them, temperature, pressure, flow rate, 
and liquid level are process variables. However, it is difficult to acquire the real-time 
quality measurement in general due to the limitation of quality sensors. Usually, the 
quality data are obtained by taking samples for laboratory test and their sampling 
frequency is much lower than that of process data. For example, product composi- 
tion, viscosity, molecular weight distribution, and other quality-related parameters 
need to be obtained through various analytical instruments in the laboratory, such as 
composition analyzers, gel permeation chromatography (GPC), or mass spectrome- 
try. 

Process data and quality data belong to two different observation spaces, so the 
corresponding statistical analysis methods are correspondingly divided into two cat- 
egories: single observation space and multiple observation spaces. This book intro- 
duces the basic multivariate statistical techniques from this perspective of observa- 
tion space. This chapter focuses on the analysis methods in single observation space, 
including PCA and FDA methods. The core of these methods lies in the spatial pro- 
jection oriented to different needs, such as sample dispersion or multi-class sample 
separation. This projection could extract the necessary and effective features while 
achieving the dimensional reduction. The next chapter focuses on the multivariate 
statistical analysis methods between two-observation space, specifically including 
PLS, CCA, and CVA. These methods aim at maximizing the correlation of variables 
in different observation spaces, and achieve the feature extraction and dimensional 
reduction. 
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18 2 Multivariate Statistics in Single Observation Space 


2.1 Principal Component Analysis 


As the modern industrial production system is becoming larger and more complex, the 
stored historical data not only has high dimensionality but also has strong coupling 
and correlation between the process variables. This also makes it impractical to 
monitor so many process variables at the same time. Therefore, we need to find 
a reasonable method to minimize the loss of information contained in the original 
variables while reducing the dimension of monitoring variables. If a small number 
of independent variables can be used to accurately reflect the operating status of 
the system, the operators can monitor these few variables to achieve the purpose of 
controlling the entire production process. 

Principal component analysis (PCA) is one of the most widely used multivari- 
ate statistical algorithm (Pan et al. 2008). It is mainly used to monitor the process 
data with high dimensionality and strong linear correlation. It decomposes high- 
dimensional process variables into a few independent principal components and 
then establishing a model. The extracted features constitute the projection principal 
component subspace (PCS) of the PCA algorithm and this space contains most of 
the changes in the system. The remaining features constitute the residual subspace, 
which mainly contains the noise and interference during the monitoring process 
and a small amount of system change information (Wiesel and Hero 2009). Due 
to the integration of variables, PCA algorithm can be able to overcome the overlap- 
ping information caused by multiple correlations, and achieve dimensional reduction 
of high-dimensional data, simultaneously. It also highlights the main features and 
removes the noise and some unimportant features in the PCS. 


2.1.1 Mathematical Principle of PCA 


Suppose data matrix X e R”*”, where m is the number of variables and n is the 
number of observations for each variable. Matrix X can be decomposed into the sum 
of outer products of k vectors (Wang et al. 2016; Gao 2013): 


X=tpl t+topyt+---+te pp, (2.1) 
where t; is score vector, also called the principal component of the matrix X, and 


p; is the feature vector corresponding to the principal component, also called load 
vector. Then (2.1) can also be written in the form of matrix: 


X=TP'. (2.2) 
Among them, T = [t,,t2,...,t,] is called the score matrix and P= 
[P1, Po. ---, Px] is called the load matrix. The score vectors are orthogonal to each 


other, 
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T r . 
t,t; =0,i Aj. (2.3) 
The following relationships exist between load vectors: 


T å r 
Pip; 50i #j 2.4 
ee a 


It is shown that the load vectors are also orthogonal to each other and the length 
of each load vector is 1. 
Multiplying the left and right sides of (2.2) by load vector p; and combining with 
(2.4), we can get 
ti = X pi. (2.5) 


Equation (2.5) shows that each score vector t; is the projection of the original data 

X in the direction of the load vector p; corresponding to t;. The length of the score 

vector t; reflects the coverage degree of the original data X in the direction of p;. 

The longer the length of t;, the greater the coverage degree or range of change of the 

data matrix X in the direction of p; (Han 2012). The score vector t; is arranged as 
follows : 

lé > Weal > itsl > +--+ > ltl. (2.6) 


The load vector p; represents the direction in which the data matrix X changes 
most, and load vector p, is orthogonal to p; and represents the second largest direc- 
tion of the data matrix X changes. Similarly, the load vector p, represents the direc- 
tion in which X changes least. When most of the variance is contained in the first r 
load vectors and the variance contained in the latter m — r load vectors is almost zero 
which could be omitted. Then the data matrix X is decomposed into the following 
forms: 

X=tpjt+topy+---+t-p, +E 


i (2.7) 

=X+E=TP'+E, 

where X is principle component matrix and E is the residual matrix whose main 

information is caused by measurement noise. PCA divides the original data space 

into principal component subspace (PCS) and residual subspace (RS). These two 

subspaces are orthogonal and complementary to each other. The principal component 

subspace mainly reflects the changes caused by normal data, while the residual 
subspace mainly reflects the changes caused by noise and interference. 

PCA is to calculate the optimal loading vectors p by solving the optimization 

problem: 
TyT 

J = max D (2.8) 

p70 pp 

The number r of principal components is generally obtained by cumulative percent 

variance (CPV). Use eigenvalue decomposition or singular value decomposition of 
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the covariance matrix of X and obtain all the eigenvalues 4;. CPV is defined as 
follows: 


CPV = =. (2.9) 


Generally, when the CPV value is greater than or equal to 85%, the corresponding 
number r is obtained. 


2.1.2 PCA Component Extraction Algorithm 


There are two algorithms to implement PCA component extraction. Algorithm 1 
is based on the singular value decomposition (SVD) of the covariance matrix and 
Algorithm 2 obtains each principal component based on Nonlinear Iterative Partial 
Least Squares algorithm (NIPALS), developed by H. Wold at first for PCA and later 
for PLS (Wold 1992). It gives more numerically accurate results compared with the 
SVD of the covariance matrix, but is slower to calculate. 

The PCA dimensional reduction is illustrated by simple two-dimensional random 
data. Figure 2.1 shows the original random data sample in two-dimensional space. 
Figure 2.2 is a visualization with principal axis and confidence ellipse of the original 
data. The green ray gives the direction with the largest variance of the original data 
and the black ray shows the direction of second largest variance. 

PCA projects the original data X from the two-dimensional space into one- 
dimensional subspace along the direction of maximum variance direction. The 
dimensional reduction is shown in Fig. 2.3. 


Fig. 2.1 Two-dimensional 3 r r r r it 
raw random data 
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Algorithm 1 SVD-based component extraction algorithm 
Input: 
Data matrix X. 
Output: 
r principal components. 
[S1] Normalize the original data set X = [x"(1), x'(2),..., xT(n)]" e R’*™ | in which x = 
[x1,%2,...,Xm] E€ R!*”, with zero mean one variance. 
[S2] Calculate the covariance matrix S of the Normalized data matrix X: 


1 
n—1 


S= xx". (2.10) 
[S3] Find the eigenvalues and eigenvectors of the covariance matrix S using eigenvalue decom- 
position: 

lA; — S|} =0 


(2.11) 
(Ail — S)p; = 0. 
[S4] Sort the eigenvalues from large to small and determine the first r eigenvalues based on the 
CPV index. Construct the corresponding eigenvector matrix P = [pj , po,..., p,] according to 
the eigenvectors D = (Aj,...,A,). 
$5] Calculate the score matrix T based on the following relationship: 
X=TP. (2.12) 
S6] The normalized data matrix X is decomposed as follows: 
X=X+E=TP'+X. (2.13) 
where X is the principal component part of the data and X is the residual part. 
return r principal components 
Fig. 2.2 Visualization of the 3.5 r r 
change principal axis and 
confidence ellipse of the 3p J 
original data 
2.5F J 
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Algorithm 2 NIPALS-based component extraction algorithm 
Input: 
Data matrix X. 
Output: 
r principal components. 
[S1] Normalize the original data X. 
[S2] Seti = 1 and choose a column x ; from X and mark it as ¢1,;, that is, t),; = xj. 
[S3] Calculate the load vector p, 


Xt; 


Pi a (2.14) 
[S4] Normalize p4, 
pi = Pi (2.15) 
Ipil 
[S5] Calculate the score vector ti j+1, 
titi = FL, (2.16) 


[S6] Compare the ¢; ; and t1, +1. If |ti =i | < £, and go to S7, where £ > Ois a very small 
positive constant. If |ti — fii | > €,seti =i + 1 and go back to S3. 

[S7] Calculate the residual FE; = X — tı pi, replace X with E; and return to S2 to calculate the 
next principal component f2 until the CPV value meets the requirements. 

[S8] r principal components are obtained, namely: 


X=t pi ttop)+---+t-p! +X =TP'4+X, (2.17) 


return r principal components 


2.1.3 PCA Base Fault Detection 


PCA can be applied to solve all kinds of data analysis problems, such as exploration 
and visualization of high-dimensional data sets, data compression, data preprocess- 
ing, dimensional reduction, removing data redundancy, and denoising. When it is 
applied to the field of FDD and the detection process is divided into offline modeling 
and online monitoring. 


(1) Offline modeling: use the training data to construct a principal component anal- 
ysis model and calculate the monitored statistics, such as SPE and TÊ, and its 
control limits; 

(2) Online monitoring: when a new sample vector x is obtain, it can be decomposed 
into projections on PCS and RS (Zhang et al. 2017), 
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Fig. 2.3 Dimensional 25 
reduction results 


£= PP'x (2.18) 


where £ is the projection of the sample x in PCS and ¥ is the projection of the 
sample in RS. Calculate the statistics, SPE (1.12) on RS and T? (1.9) on PCS 
of new sample x, respectively. Compare the statistics of new sample with the 
control limits obtained from the training data. If the statistics of the new sample 
exceeds the control limit, it means that a fault has occurred, otherwise the system 
is in the normal operation. 


£ and x are not only orthogonal (Tz = 0) but also still statistically independent 


` (#7) = 0). So, there are natural advantages to apply PCA algorithm to process 


m= 


monitoring. The flowchart of PCA based fault detection is shown in Fig. 2.4. In 
general, the fault detection process based on multivariate statistical analysis is similar 
as that of PCA, only the statistical model and statistics index are different. 


2.2 Fisher Discriminant Analysis 


Industrial processes are heavily instrumented and large amounts of data are collected 
online and stored in computer database. A lot of data are usually collected during out- 
of-control operations. When the data collected during an out-of-control operation has 
been previously diagnosed, the data can be classified into separate categories, where 
each category is related to a specific fault. When the data has not been diagnosed 
before, cluster analysis can help diagnose the operation of collecting data, and the data 
can be divided into a new category accordingly. If hyperplanes can separate the data 
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Fig. 2.4 PCA-based fault detection 


in the class, as shown in Fig. 2.5, these separation planes can define the boundaries 
of each fault area. Once a fault is detected using the online data observation, the fault 
can be diagnosed by determining the fault area where the observation is located. 
Assuming that the detected fault is represented in the database, the fault can be 
correctly diagnosed in this way. 


2.2.1 Principle of FDA 


Fisher discriminant analysis (FDA), a dimensionality reduction technique that has 
been extensively studied in the pattern classification domain, takes into account the 
information between the classes. For fault diagnosis, data collected from the plant 
during in the specific fault operation are categorized into classes, where each class 
contains data representing a particular fault. FDA is a classical linear dimensional- 
ity reduction technique that is optimal in maximizing the separation between these 
classes. The main idea of FDA is to project data from a high-dimensional space into a 
lower dimensional space, and to simultaneously ensure that the projection maximizes 
the scatter between classes while minimizing the scatter within each class. It means 
that the high-dimensional data of the same class is projected to the low-dimensional 
space and clustered together, but the different classes are far apart. 

Given training data for all classes X € R”*™, where n and m are the number of 
observations and measurement variables, respectively. In order to understand FDA, 
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Fig. 2.5 Two-dimensional comparison of FDA and PCA 


it is first necessary to define various matrices, including the total scatter matrix, intra- 
class (within-class) scatter matrix, and inter-class (between-class) scatter matrix. The 
total scatter matrix is 


n 


S=} 0-50-55", (2.19) 


i=l 


where x (i) represents the vector of measurement variables for the i-th observation 
and x is the total mean vector. 


= l ; 
X= eee (2.20) 
The within-scatter matrix for class j is 
so fe ao SAT 
S= $ O-z) O- ;), (2.10) 
x(DeX; 


where 1; is the set of vectors x(i) which belong to the class j and x; is the mean 
vector for class j: 


2 1 
¥j=— J, x, (2.22) 


J xiex; 
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where n; is the number of observations in the j-th class. The intra-class scatter 
matrix is 


SoS > Sj, (2.23) 


where p is the number of classes. The inter-class scatter matrix is 
2? T 
Se= nj; (¥; —¥) (%;- 2). (2.24) 
j=l 


It is obvious that the following relationship always holds: 
S: = So + Sw. (2.25) 


The maximum inter-class scatter means that the sample centers of different classes 
are as far apart as possible after projection (max vT Spv). The minimum intra-class 
scatter is equivalent to making the sample points of the same class after projection to 
be clustered together as much as possible (min v' Sv, |Sw| Æ 0), where v € R”. 

The optimal FDA project w is obtained by 


w'S,w 


J = max (2.26) 


w40 wl S,,w- 


Both the numerator and denominator have project vector w. Considering that w 
and aw, a Æ 0 have the same effect, Let w'S,,w = 1, then the optimal objective 
(2.26) becomes 

J = max w!'S,w 
” (2.27) 
s.t. w'S,,w = 1. 


Firstly, let’s consider the optimization of first FDA vector w,. Solving (2.27) by 
Lagrange multiplier method. 


L(wy, At) = wy Spwi — ài (wi Suw: — 1) 
Find the partial derivative of L with respect to w). 


OL 
z- = 2S w; = 2A, Syw] 
OW 


The first FDA vector is equal to the eigenvectors w, of the generalized eigenvalue 
problem. 
Spwi =A1S,wi > S; Spwi = Aw. (2.28) 
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The first FDA vector boils down to finding the eigenvector w; corresponding to 
the largest eigenvalue of the matrix S Fa Sp. 

The second FDA vector is captured such that the inter-class scatter is maximized, 
while the intra-class scatter is minimized on all axes perpendicular to the first FDA 
vector and the same is true for the remaining FDA vectors. The kth FDA vectors is 
obtained by 


-1 
Sa Sp pwk = ÀkWk, 


where A, > Az > --- > àp-1 and A, indicate the degree of overall separability among 
the classes by projecting the data onto wx. 

When S,, is invertible, the FDA vector can be computed from the generalized 
eigenvalue problem. This is almost always true as long as the number of observations 
n is significantly larger than the number of measurements m (the case in practice). If 
the S,, matrix is not invertible, you can use PCA to project data into mı dimensions 
before executing FDA, in which m, is the number of non-zero eigenvalues of the 
covariance matrix S;,. 

The first FDA vector is the eigenvector associated with the largest eigenvalue, the 
second FDA vector is the eigenvector associated with the second largest eigenvalue, 
and so on. The large eigenvalue A, shows that when the data in classes are projected 
onto the associated eigenvector wz, there is a large overall separation of class means 
relative to the variance of the class, and thus, a large degree of separation among 
classes along the direction of w. Since the rank of S, is less than p and at most 
p — 1 eigenvalues are not equal to zero. The FDA provides a useful ordering of 
eigenvectors only in these directions. 

When FDA is used as a pattern classification, the dimensionality reduction 
technique is implemented for all classes of data at the same time. Denote W, = 
[w1, W2,..., Wa] E R””**. The discriminant function can be deduced as 


-1 
gt) =- 5-8)" we ( —Wis,W,) Wi (x — ¥;) +n (p;) 
= 


li det l w's.w 
— — In} de Wa ‘ 
2 nj—1 ary 
(2.29) 


FDA can also be used to detect faults by defining an additional class of data on 
top of the fault class, i.e., data collected under normal operating conditions. The 
reliability of fault detection using (2.29) depends on the similarity between the data 
from normal operating conditions and the fault class data in the training set. Fault 
detection using FDA will yield small miss rates for known fault classes when a 
transformation W exists such that data from normal operating conditions can be 
reasonably separated from other fault classes. 
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2.2.2 Comparison of FDA and PCA 


As two classical techniques for dimensionality reduction of a single data set, PCA 
and FDA exhibit similar properties in many aspects. The optimization problems of 
PCA and FDA, respectively, formulated mathematically in (2.8) and (2.26), can also 
be captured as 


T 
S 
Joca = max =e (2.30) 
w0 ww 
w'S,w 
je 2.31 
FDA a D'S, AD (2.31) 


In the special case, S$, = al, a Æ 0, their vector optimization results are iden- 
tical. This would occur if the data for each class could be described by a uniformly 
distributed ball (i.e., without a dominant direction), even if these balls had different 
sizes. The difference between these two techniques only occurs when the data used to 
describe either class appears elongated. These elongated shapes occur on highly cor- 
related data sets, for example, the data collected in industrial processes. Thus, when 
FDA and PCA are applied to process data in the same way, the FDA vectors and 
the PCA loading vectors are significantly different. The different objectives of (2.30) 
and (2.31) show that the FDA has superior performance than PCA at distinguishing 
among fault classes. 

Figure 2.5 illustrates a difference between PCA and FDA. The first FDA vector 
and the PCA loading vector are almost perpendicular. PCA is to map the entire data 
set to the coordinate axis that is most convenient to represent the data. The mapping 
does not use any classification information inside the data. Therefore, although the 
entire data set is more convenient to represent after PCA (reducing the dimensionality 
and minimizing the loss of information), it may become more difficult to classify. It 
is found that the projections of red and blue are overlapped in the PCA direction, but 
separated in the FDA direction. The two sets of data become easier to distinguish (it 
can be distinguished in low dimensions, reducing large amount of calculations) by 
FDA mapping. 

To illustrate more clearly the difference between PCA and FDA, the following 
numerical example of binary classification is given. 


xı = [5+ 0.054 (0, 1); 3.2 + 0.94 (0, 1)] € R?! 


x2 = [5.1 + 0.05u (0, 1); 3.2 + 0.94 (0, 1)] € R2*! 


X = [x1, x2] € R”, 


where a (0, 1) € R!*!0 is auniformly distributed random vector on [0, 1]. X is a two- 
mode data and its projection of FDA and PCA is shown in Fig. 2.6. The distribution 
of the data in the classes is somewhat elongated. The linear transformation of the 
data on the first FDA vector separates the two types of data better than the linear 
transformation of the data on the first PCA loading vector. 
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Fig. 2.6 Two-dimensional data projection comparison of FDA and PCA 


Both PCA and FDA can be used to classify the original data after dimension- 
ality reduction. PCA is an unsupervised method, i.e. it has no classification labels. 
After dimensionality reduction, unsupervised algorithms such as K-Means or self- 
organizing mapping networks are needed for classification. The FDA is a supervised 
method. It first reduces the dimensionality of the training data and then finds a linear 
discriminant function. The similarities and differences between FDA and PCA can 
be summarized as follows. 


1. Similarities 


(1) Both functions are used to reduce dimensionality; 
(2) Both assume Gaussian distribution. 


2. Differences 


(1) FDA is a supervised dimensionality reduction method, while PCA is unsu- 
pervised; 

(2) FDA dimensionality reduction can be reduced to the number of categories 
k — 1 at most, PCA does not have this restriction; 

(3) FDA is more dependent on the mean. If the sample information is more 
dependent on variance, the effect will not be as good as PCA; 

(4) FDA may overfit the data. 
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Chapter 3 A) 
Multivariate Statistics Between coe fx 
Two-Observation Spaces 


As mentioned in the previous chapter, industrial data are usually divided into two 
categories, process data and quality data, belonging to different measurement spaces. 
The vast majority of smart manufacturing problems, such as soft measurement, con- 
trol, monitoring, optimization, etc., inevitably require modeling the data relationships 
between the two kinds of measurement variables. This chapter’s subject is to discover 
the correlation between the sets in different observation spaces. 

The multivariate statistical analysis relying on correlation among variables gener- 
ally include canonical correlation analysis (CCA) and partial least squares regression 
(PLS). They all perform linear dimensionality reduction with the goal of maximizing 
the correlation between variables in two measurement spaces. The difference are that 
CCA maximize correlation, while PLS maximize covariance. 


3.1 Canonical Correlation Analysis 


Canonical correlation analysis (CCA) was first proposed by Hotelling in 1936 
(Hotelling 1936). It is a multivariate statistical analysis method that uses the cor- 
relation between two composite variables to reflect the overall correlation between 
two sets of variables. The CCA algorithm is widely used in the analysis of data cor- 
relation and it is also the basis of partial least squares. In addition, it is also used in 
feature fusion, data dimensionality reduction, and fault detection (Yang et al. 2015; 
Zhang and Dou 2015; Zhang et al. 2020; Hou 2013; Chen et al. 2016a, b). 
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3.1.1 Mathematical Principle of CCA 


Assuming that there are / dependent variables y = (y1, y2,..., y1)" and m indepen- 
dent variables x = (x1, .%2,..., Xm)'. In order to capture the correlation between the 
dependent variables and the independent variables, n sample points are observed, 
which constitutes two data sets 


X = [x(1), x(2),...,x(n)]’ € R™™”" 


Y = [y(), yQ),.... y] E R 


CCA draws on the idea of component extraction to find a canonical component u, 
which is a linear combination of variables x;; and a canonical component v, which 
is a linear combination of y;. In the process of extraction, the correlation between 
u and v is required to be maximized. The correlation degree between u and v can 
roughly reflect the correlation between X and Y. 

Without loss of generality, assuming that the original variables are all standardized, 
i.e., each column of the data set X and Y has mean 0 and variance 1, the covariance 
matrix of cov(X, Y) is equal to its correlation coefficient matrix, in which, 


1PxT™x XY] [En E» 
a n= [ory YY (2 By 


PCA is analyzed for Xx, or Xy, while CCA is analyzed for Xy 
Now the problem is how to find the direction vectors œ and 3, and then use them 
to construct the canonical components: 


u = AX, + QX +++ + AnXm 


(3.1) 
v = By, + b2y2 + +- + By, 


where a = [a1], @2, ..-, &m]' € R”™!, B = [81, bo, ..., BJE € R’*!, such that the 
correlation between u and v is maximized. Obviously, the sample means of u and v 
are zero, and their sample variances are as follows: 


var(u) = a'na 
var(v) = B'E, B 
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The covariance of u and v is 
cov(u, v) = aE b. 


One way to maximize the correlation of u and v is to make the corresponding corre- 
lation coefficient maximum, i.e., 


cov(uv) 
max p(u, v) = ———__—_-. (3.2) 
J/ var (u)var(v) 
In CCA, the following optimization objective is used: 
Jeca = max < u,v >=a' ZB 
CCA y (3.3) 


s.t. aE na =l; B28 =1. 


This optimization objective can be summarized as follows: to seek a unit vector œ on 
the subspace of X and a unit vector 6 on the subspace of Y such that the correlation 
between u and v is maximized. Geometrically, p(u, v) is again equal to the cosine 
of the angle between u and v. Thus, (3.3) is again equivalent to making the angle w 
between u and v take the minimum value. 

It can be seen from (3.3) that the goal of the CCA algorithm is finally transformed 
into a convex optimization process. The maximum value of this optimization goal is 
the correlation coefficient of X and Y, and the corresponding œ and @ are projection 
vectors, or linear coefficients. After the first pair of canonical correlation variables 
are obtained, the second to kth pair of canonical correlation variables that are not 
correlated with each other can be similarly calculated. 

The following Fig. 3.1 shows the basic principle diagram of the CCA algorithm: 

At present, there are two main methods which include eigenvalue decomposition 
and singular value decomposition for optimizing the above objective function to 
obtain œ and 8. 
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Fig. 3.1 Basic principle diagram of the CCA algorithm 
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3.1.2 Eigenvalue Decomposition of CCA Algorithm 


Using the Lagrangian function, the objective function of (3.3) is transformed as 
follows: 


À À 
max Jcca (a, B) = a E y8 — Fa Ena -1)— F BES —1). (3.4) 
Set ol = 0 and of = 0, then 


Eyb m Ay Xxx =0 


3.5 
Xa- MEB = 0. oe 


Let A = A; = Az = a" Xyp, and multiply (3.5) to the left by X7; and X7, 
respectively, and get: 


IIE oe 
my (3.6) 
Xy Xa = AB. 
Substituting the second formula in (3.6) into the first formula, we can get 
EEr na Na (3.7) 


From (3.7), we can get the largest eigenvalue \ and the corresponding maximum 
eigenvector a only by eigenvalue decomposition of the matrix X = Lyd Zx. In 
the similar way, the vector 6 can be obtained. At this time, the projection vectors œ 
and 8 of a set of canonical correlation variables can be obtained. 


3.1.3 SVD Solution of CCA Algorithm 


Leta = x-a, B= Xb, and then we can get 
a Ena = l> aE P En a =l o aas l 
B'E pB =1> b EPEE b= 1 > bb = 1 (3.8) 
aE yB =a le E yE b. 
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In other words, the objective function of (3.3) can be transformed into as follows: 


Jcca (a, b) = arg max aE IP Ey X’ b 


(3.9) 
s.t. aa = b'b = 1. 
A singular value decomposition for matrix M yields 
M = 5-128, 5-12 = rgy", g = |41? (3.10) 
— XX xy “yy i , ae 0 0 : 


where « is the number of principal elements or non-zero singular values, and «x < 
min(l, m), A; = diag (A, ..-, Ax), Al >Z © > Àk O. 

Since all columns of F and W are standard orthogonal basis, a’ and W"D are 
vectors with only one scalar value of 1, and the remaining scalar value of 0. So, we 
can get 

a LP Ly Dy b = a TE = ogy. (3.11) 

From (3.11), it can be seen that a? X ie 2S yZ = 2b maximizes actually the left 
and right singular vectors corresponding to the maximum singular values of M. Thus, 
using the corresponding left and right singular vectors F and W, we can obtain the 
projection vectors œ and @ for a set of canonical correlation variables, namely, 


— ş-!/2 
a=xX a 


—1/2 
B= Xb. 


yy 


(3.12) 


3.1.4 CCA-Based Fault Detection 


When there is a clear input-output relationship between the two types of data mea- 
surable online, CCA can be used to design an effective fault detection system. The 
CCA-based fault detection method can be considered as an alternative to PCA-based 
fault detection method, and an extension of PLS-based fault detection method (Chen 
et al. 2016a). 

Let 


J, = EZT, 1: 46) 
Ls = Ey? WG, 1:4) 


J tes = SATC, K+ 1: D 
Lyes = z= WG K+ l: m). 
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According to CCA method, J} x and LT y are closely related. However, in actual 
systems, measurement variables are inevitably affected by noise, and the correlation 
between J] x and LT y can be expressed as 


Lh y(k) = A, Jy x(k) + vs (k), (3.13) 


where vs is the noise term and weakly related to J Ti Based on this, the residual 
vector is 
ri(k) = Ll y(k) — AL IT x(k). (3.14) 


Assume that the input and output data obey the Gaussian distribution. It is known 
that linear transformation does not change the distribution of random variables, so 
the residual signal rı also obeys the Gaussian distribution and its covariance matrix 
is 


1 T Ik- A 
i= (LTY — ALJ, X)(LIY-—ALJ{U) = ao (3.15) 
Similarly, another residual vector can be obtained 
ro(k) = Jix(k) — AL? y(k). (3.16) 
Its covariance matrix is 
£, =—— (JU — A,LTY) (JTU — A,LTY)" = Lr (3.17) 
N-1 N-1 


It can be seen from formula (3.15)—(3.16) that the covariance of residual rı and r2 
are the same. For fault detection, the following two statistics can be constructed: 


Tr(k) = (N — Dri (k) (Ik — Az) ri(k) (3.18) 


T3(k) = (N — Dr3(k) (Ik — A?) r2(k). (3.19) 


3.2 Partial Least Squares 


Multiple linear regression analysis is relatively common and the least square method 
is generally used to estimate the regression coefficient in this type of regression 
method. But the least square technique often fails when there is multiple correlation 
between the independent variables or the number of samples is less than the number of 
variables. So the partial least square technique is developed to resolve this problem. 
S. Wold and C. Albano et al. proposed the partial least squares method for the 
first time and applied it to the field of chemistry (Wold et al. 1989). It aims at 
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the regression modeling between two sets of multi-variables with high correlation 
and integrates the basic functions of multiple linear regression analysis, principal 
component analysis, and canonical correlation analysis. PLS is also called the second- 
generation regression analysis method due to its simplification model in the data 
structure and correlation (Hair et al. 2016). It has developed rapidly and widely used 
in various fields recent years (Okwuashi et al. 2020; Ramin et al. 2018). 


3.2.1 Fundamental of PLS 


Suppose there are / dependent variables (y1, y2, ..., yz) and m independent variables 
(x1, X2, - - - , Xm). In order to study the statistical relationship between the dependent 
variable and the independent variable, n sample points are observed, which consti- 
tutes a data set (X = [x], X2,...,Xm] E R” , Y = [y1, y2, ..., Yı] € R”*!) of the 
independent variables and the dependent variables. 
To address the problems encountered in least squares multiple regression between 
X and Y, the concept of component extraction is introduced in PLS regression 
analysis. Recall that principal component analysis, for a single data matrix X, finds 
the composite variable that best summarizes the information in the original data. The 
principal component T in X is extracted with the maximum variance information of 
the original data: 
max var (T), (3.20) 


PLS extracts component vectors t; and u; from X and Y , which means t; is a linear 
combination of (x1, x2, . . . , Xm), and u; is a linear combination of (y1, y2,..., yi). 
During the extracting of components, in order to meet the needs of regression analysis, 
the following two requirements should be satisfied: 


(1) ¢; and u; carry the variation information in their respective data set as much as 
possible, respectively; 
(2) The correlation between t; and u; is maximized. 


The two requirements indicate that t; and u; should represent the data set X and 
Y as well as possible and the component t; of the independent variable has the best 
ability to explain the component u; of the dependent variable. 


3.2.2 PLS Algorithm 


The most popular algorithm used in PLS to compute the vectors in the calibration 
step is known as nonlinear iterative partial least squares (NIPALS). First, normalize 
the data to achieve the purpose of facilitating calculations. Normalize X to get matrix 
Eo and normalize Y to get matrix Fo: 
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X11 +++ Xim yir =” Yl 
Eo=|: : i |, Fo=fj: i: (3.21) 


Xn1 *** Xmn, Yni `: Xnl. 


In the first step, set tı (tı = Eow1) to be the first component of Eo, and w; 


is the first direction vector of Eo, which is a unit vector, ||w,|| = 1. Similarly, set 
u; (u; = Focj;) to be the first component of Fo, and Eo is the first direction vector 
of Fo, which is a unit vector too, ||c;|| = 1. 


According to the principle of principal component analysis, t; and uw; should meet 
the following conditions in order to be able to represent the data variation information 


in X and Y well: 


max var (tı) (3.22) 
max var (u1) f 


On the other hand, ¢; is further required to have the best explanatory ability for 

u; due to the needs of regression modeling. According to the thinking of canonical 

correlation analysis, the correlation between t; and u; should reach the maximum 
value: 

maxr (tı, u1). (3.23) 


The covariance of tı and u; is usually used to describe the correlation in partial 
least squares regression: 


max Cov (tı, u1) = y Var (t1) var (u)r (t1, u1) (3.24) 


Converting to the normal mathematical expression, ¢; and u; is solved by the 
following optimization problem: 


max (Eyw,, Foc1) 


W1,C} 
wiw = 1 (3.25) 


T 
cc = 1. 


Therefore, it needs to calculate the maximum value of wi E}F ocı under the 
constraints of || w, ||? = 1 and |le;||? = 1. 
In this case, the Lagrangian function is 


s = wy E" Foc, — A; (ww: — 1) — A (efe1 — 1). (3.26) 


Calculate the partial derivatives of s with respect to w1, c1, Àq, and A2, and let 
them be zero 


o 
ur =< Ej Foc; = 21w] = 0, (3.27) 
1 
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a = ET Fyw, — 2c, =0, (3.28) 
1 
ð 
a = — (wlw — 1) =0, (3.29) 
ð 
a = —(cTe, — 1) =0. (3.30) 


It can be derived from the above formulas that 
2A1 = 2A2 = wI E} Foci = (Eow,, Foci) (3.31) 


Let ĝi = 2A, = 2A2 = wl Ej Foci , sO 9; is the value of the objective function of 
the optimization problem (3.25). Then (3.27) and (3.28) are rewritten as 


Ej Foci = wi, (3.32) 
Fj Eow: = bic}. (3.33) 

Substitute (3.33) into (3.32), 
ET FoFjEow = w). (3.34) 


Substitute (3.32) into (3.33) simultaneously, 
FTE E] Foci = 0c. (3.35) 


Equation (3.34) shows that w; is the eigenvector of matrix E TF oF TE o with the 
corresponding eigenvalue 6. Here, 6, is the objective function. If we want to get 
its maximum value, w; should be the unit eigenvector of the maximum eigenvalue 
of matrix E TF oF TE o. Similarly, cı should be the unit eigenvector of the largest 
eigenvalue of the matrix F TE oE 3 Fo. 

Then the first components tı and uw; are calculated from the direction vectors w; 
and cı: 


tı = Eow 
ee (3.36) 
uy = Foc). 
The regression equations of Eo and F is found by t; and u;: 
Eo=tip, +E; 
Fo=uiq, + Fi (3.37) 


Fo=tir| + Fi. 
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The regression coefficient vectors in (3.37) are 


_ Eft, 
PT Ta 
Flu 
q = Z (3.38) 
lx |l 
Foti 
ry = z: 
Ileal 


E,, Fý and F; are the residual matrices of the three regression equations. 
Second step is to replace Eo and Fo with residual matrices E; and F4, respec- 
tively. Then find the second pair of direction vectors w2, c2, and the second pair of 
components t and uo: 
to = Ew 


u = Fico (3.39) 
On = w E} Fico. 


Similarly, wọ is the unit eigenvector corresponding to the largest eigenvalue of 
matrix ET F, FTE, and cp is the unit eigenvector of the largest eigenvalue of matrix 
FTE, EĮ F. Calculate the regression coefficient 


Ets 

P2 = 2 
Ital ga) 
Fits 

F2 = z: 
llt2ll 

The regression equation is updated: 
E; =top, +E (3.41) 


F, = tor} + Fo. 


Repeat the calculation according to the above steps. If the rank of X is R, the 
regression equation can be obtained: 


Eo =tip; +---+trpp 


(3.42) 
Fo=tir, +-:-+trrp + Fr. 

If the number of feature vectors used in the PLS modeling is large enough, the 
residuals could be zero. In general, it only needs to select aja < R) components 
among them to form a regression model with better prediction. The number of prin- 
cipal components required for modeling is determined by cross-validation discussed 
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in Sect. 3.2.3. Once the appropriate component number is determined, the external 
relationship of the input variable matrix X as 


a 


X=TP'+X=) typ, +X. (3.43) 


h=1 


The external relationship of the output variable matrix Y can be written as 


Y=UQ'+¥=) ung, t+¥. (3.44) 


h=1 


The internal relationship is expressed as 


it), = brth, b, = ti upn/tyth- (3.45) 


3.2.3 Cross-Validation Test 


In many cases, the PLS equation does not require the selection of all principal com- 
ponents for regression modeling, but rather, as in principal component analysis, the 
first d(d < l) principal components can be selected in a truncated manner, and a 
better predictive model can be obtained using only these d principal components. 
In fact, if the subsequent principal components no longer provide more meaningful 
information to explain the dependent variable, using too many principal components 
will only undermine the understanding of the statistical trend and lead to wrong 
prediction conclusions. The number of principal components required for modeling 
can be determined by cross-validation. 

Cross-validation is used to prevent over-fitting caused by complex model. Some- 
times referred to as the circular estimation, it is a statistically useful method for 
cutting data sample into smaller subset. This is done by first doing the analysis on 
a subset, while the other subset is used for subsequent confirmation and validation 
of this analysis. The subset used for analysis is called the training set. The other 
subset is called validation set and generally separated from the testing set. Two 
cross-validation methods often used in practice are K -fold cross-validation (K-CV) 
and leave-one-out cross-validation (LOO-CV). 

K-CV divides the original data into K groups (generally evenly divided), makes 
each subset of data into a validation set once separately. The rest of the K — 1 subsets 
are considered as the training set, so K-CV will result in K models. In general, K is 
selected between 5 and 10. LOO-CV is essentially N-CV. The process of determining 
the number of principal components will be described in detail using LOO-CV as 
an example. 
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All n samples are divided into two parts: the first part is the set of all samples 
excluding a certain sample i (containing a total of n — 1 samples) and a regression 
equation is fitted with this data set using d principal components; The second part is 
to substitute the ith sample that was just excluded into the fitted regression equation 
to obtain the predicted value ĝa); (d), j = 1,2,...,l of yj. Repeating the above test 
for each i = 1, 2,...,n, the sum of squared prediction errors for y; can be defined 
as PRESS; (d). 


n 


PRESS; (d) = È` (yij — $mj)@O) j S1, 2L (3.46) 


i=l 


The sum of squared prediction errors of Y = (y,,..., yı)" can be obtained as 


L 
PRESS (d) = 5 PRESS; (d). (3.47) 


j=l 


Obviously, if the robustness of the regression equation is not good, the error is large 
and thus it is very sensitive to change in the samples, and the effect of this perturbation 
error will increase the PRESS(d) value. 

On the other hand, use all sample points to fit a regression equation containing d 
components. In this case, the fitted value of the ith sample point is ĵ;; (d). The fitted 
error sum of squares for y; is defined as SS;(d) value 


n 


SSj(d) = Y (yy — $u DV. (3.48) 


i=1 


The sum of squared errors of Y is 


l 
SS(d) = X SSj) (3.49) 
i=1 


Generally, PRESS(d) is greater than SS(d) because PRESS (d) contains an unknown 
perturbation error and the fitting error decreases with the increase of components, i.e., 
SS(d) is less than SS(d — 1). Next, compare SS(d — 1) and PRESS(d). SS(d — 1) 
is the fitting error of the regression equation that is fitted with all samples with d 
components; PRESS(d) contains the perturbation error of the samples but with one 
more component. If the d component regression equation with perturbation error 
can be somewhat smaller than the fitting error of the d — 1 component regression 
equation, it is considered that adding one component fg will result in a significant 
improvement in prediction accuracy. Therefore, it is always expected that the ratio 


of a is as small as possible. The general setting 
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PRESS (d) < 
SS(d—1) ~ 


0.05)? = 0.957. (3.50) 


IF PRESS(d) < 0.95?SS(d — 1), the addition of the component is considered ben- 
eficial. And conversely, if PRESS(d) > 0.957SS(d — 1), the new addition of com- 
ponents is considered to have no significant improvement in reducing the prediction 
error of the regression equation. 

In practice, the following cross-validation index is used. For each dependent vari- 
able yj, define 
PRESS ; (d) 


2 
=1-—— 
Qij SS;(d — 1) 


(3.51) 


For the full dependent variable Y, the cross-validation index of component tą is 
defined as 
Zij PRESS (d) 


See 3.52 
Qi SaD (3.52) 

The marginal contribution of component tg to the predictive accuracy of the 
regression model has the following two scales (cross-validation index). 


(1) Q4 > 1 — 0.95? = 0.0975, the marginal contribution of tg component is signif- 
icant; and 

(2) Fork = 1,2,...,1/, there is at least one k such that Qi > 0.0975 holds, at which 
point the addition of component tą leads to a significant improvement in the 
prediction accuracy of at least one dependent variable yg. Therefore it can also 
be argued that adding component t4 is clearly beneficial. 
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Chapter 4 ®) 
Simulation Platform for Fault Diagnosis get 


The previous chapters have described the mathematical principles and algorithms 
of multivariate statistical methods, as well as the monitoring processes when used 
for fault diagnosis. In order to validate the effectiveness of data-driven multivariate 
statistical analysis methods in the field of fault diagnosis, it is necessary to conduct the 
corresponding fault monitoring experiments. Therefore this chapter introduces two 
kinds of simulation platform, Tennessee Eastman (TE) process simulation system and 
fed-batch Penicillin Fermentation Process simulation system. They are widely used 
as test platforms for the process monitoring, fault classification, and identification of 
industrial process. The related experiments based on PCA, CCA, PLS, and FDA are 
completed on the TE simulation platforms. 


4.1 Tennessee Eastman Process 


The original TE industrial process control problem was developed by Downs and 
Vogel in 1993. It is used for the open and challenging control-related topics including 
multi-variable controller design, optimization, adaptive and predictive control, non- 
linear control, estimation and identification, process monitoring and diagnosis, and 
education. TE process model is established according to the actual chemical process. 
It has been widely used as a benchmark for control and monitoring research process. 
Figure4.1 shows the flow diagram of TE process with five major units: reactor, 
condenser, compressor, vaporliquid separator, and stripper. Four kinds of gaseous 
material A, C, D, and E are input for reaction. In addition, a small amount of 
inert gas B is contained besides the above feeds. The final products are three liquid 
including G, H, and F, where F is the by-product. 
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Fig. 4.1 Tennessee Eastman process 


A(g) + C(g) + D(g) > G(liq), product G 
A(g) + C(g) + E(g) > H(liq), product H 
A(g) + E(g) > F(liq), by-product 

3D(g) > 2F (liq), by-product 


Briefly, TE process consists of two data modules: XMV module containing 12 
manipulated variables (XMV(1)-XMV(12):x23 — x34) and XMEAS module consist- 
ing of 22 process measured variables (CKMEAS(1)-XMEAS(22):x; — x22) and 19 
component measured variables (XMEAS(23)-XMEAS(41):x35 — x53), as listed in 
Tables 4.1 and 4.2. 

In this book, the code provided is available on the website online at http://depts. 
washington.edu/control/LARRY/TE/download.html. Also, the code and data sets 
can be downloaded. The Simulink simulator allows an easy setting and generation 
of the operation modes, measurement noises, sampling time, and magnitudes of 
the faults. It is thus very helpful for the data-driven process monitoring study. 21 
artificially disturbances (considered as faulty operations for fault diagnosis problem) 
in the TE process are shown in Table 4.3. In general, the entire TE data consists of 
training set and testing set, and each set includes 22 kinds of data under different 
simulation operations. Each kind of data has sampled measurements on 53 observed 
variables. 

In the data set given in the web link above, d00.dat to d21.dat are training sets, and 
d00_te.dat to d21_te.dat are testing sets. d00.dat and d00_te.dat are samples under 
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Table 4.1 Monitoring variables in the TE process(x; — x34) 


No. Variable name Units No. Variable name Units 

xı A feed (stream 1) kscmh X18 Stripper temperature ne 

x2 D feed (stream 2) kgh“! X19 Stripper steam flow kgh7! 

X3 E feed (stream 3) kg h7! X20 Compress work KW 

X4 A and C feed (steam 4) kscmh X21 Reactor cooling water °C 
outlet temperature 

x5 Recycle flow (stream 8) kscmh x22 Condenser cooling water >C 
outlet temperature 

X6 Reactor feed rate (stream 6)| kscmh X23 D feed flow valve (stream | % 
2) 

x7 Reactor pressure kPa gauge | x24 E feed flow valve (stream | % 
3) 

xg Reactor level % X25 A feed flow valve (stream | % 
1) 

X9 Reactor temperature °C X26 A and C feed flow valve % 
(stream 4) 

X10 Purge rate (stream 9) kscmh X27 Compressor recycle valve | % 

X11 Product separator °C X28 Purge valve (stream 9) % 

temperature 

x12 Product separator level % X29 Separator pot liquid flow % 
valve (stream 10) 

X13 Product separator pressure | kPa gauge | x30 Stripper liquid product % 
flow valve (stream 11) 

X14 Product separator mh! X31 Stripper steam valve % 


underflow (stream 10) 


X15 Stripper level % X32 Reactor cooling water flow | % 
valve 
X16 Stripper pressure kPa gauge | x33 Condenser cooling water % 
flow valve 
X17 Stripper underflow (stream | m?h7! X34 Agitator speed 
11) 


the normal operation conditions. The training samples of d00.dat are sampled under 
25h running simulation. The total number of observations is 500. The d00_te.dat 
test samples are obtained under 48h running simulation, and the total number of 
observation data is 960. d01.dat—d21.dat (for training) and d01_te.dat—d21_te.dat 
(for testing) are sampled with different faults, in which the numerical label of the 
data set are corresponding to the fault type. 

All the testing data set are obtained under 48h running simulation with the faults 
introduced at 8h. A total of 960 observations are collected, in which the first 160 
observations are in the normal operation. It is worth to point out that the data sets 
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Table 4.2 Monitoring variables in the TE process(x35 — x53) 


No. Variable name Stream No. Variable name Stream 
X35 Composition A 6 X45 Composition E 9 

X36 Composition B 6 X46 Composition F 9 

X37 Composition C 6 X47 Composition G 9 

X38 Composition D 6 X48 Composition H 9 

X39 Composition E 6 X49 Composition D 11 
X40 Composition F 6 X50 Composition E 11 

X41 Composition A 9 X51 Composition F 11 
X42 Composition B 9 X52 Composition G 11 
X43 Composition C 9 X53 Composition H 11 

X44 Composition D 9 
Table 4.3 Disturbances for the TE process 

IDV | Process variable Tape 

1 A/C feed ratio, B composition constant (stream 4) Step 

2 B composition, A/C feed ratio constant (stream 4) Step 

3 D feed temperature (stream 2) Step 

4 Reactor cooling water inlet temperature Step 

5 Condenser cooling water inlet temperature Step 

6 A feed loss (stream 1) Step 

T C header pressure loss—reduced availability (stream 4) Step 

8 A, B, C feed composition (stream 4) Random 
9 D feed temperature (stream 2) Random 
10 C feed temperature (stream 4) Random 
11 Reactor cooling water inlet temperature Random 
12 Condenser cooling water inlet temperature Random 
13 Reaction kinetics Slow drift 
14 Reactor cooling water valve Sticking 
15 Condenser cooling water valve Sticking 
16 Unknown Unknown 
17 Unknown Unknown 
18 Unknown Unknown 
19 Unknown Unknown 
20 Unknown Unknown 
21 Valve position (stream 4) Constant 
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once generated by Leoand et al. (2001) is widely accepted for process monitoring 
and fault diagnosis research. The data sets are smoothed, filtered, and normalized. 
The monitored variables are variables x, — x53. 


4.2 Fed-Batch Penicillin Fermentation Process 


Fed-batch fermentation processes are widely used in the pharmaceutical industry. 
The yield maximization is usually considered as the main goal in the batch fermen- 
tation processes. The different characteristics of batch operation from the continu- 
ous operation include strong nonlinearity, non-stationary conditions, batch-to-batch 
variability, and strong time-varying conditions. These features result that the yield 
is difficult to predict. Therefore, the fault detection, classification, and identification 
of batch/fed-batch processes shows more difficulties compared with the continuous 
TE process. 

The model of fed-batch penicillin fermentation process is described by Birol et al. 
(2002) 


X =f (X,S, Cri, H,T) 
S=f(X, S,C,, H,T) 
Cr =f (X, S,C., H,T) 
P=f(X,S,CL,H,T, P) 
CO2 =f (X, H, T) 
H=f(X, H,T), 


where X, S, Cz, P, CO2, H and T are biomass concentration, substrate concen- 
tration, dissolved oxygen concentration, penicillin concentration, carbon dioxide 
concentration, hydrogen ion concentration for pH ([H +J), and temperature, respec- 
tively. The corresponding detailed mathematical model is given in Birol et al. (2002). 

The research group with the Illinois Institute of Technology has developed a 
dynamic simulation of penicillin production based on an unstructured model, Pen- 
Sim V2.0. This model has been used as a benchmark for statistical process monitoring 
studies of batch/fed-batch reaction process. The flow chart of the fermentation pro- 
cess is depicted in Fig. 4.2. The fermentation unit consists of a fermentation reactor 
and a coil-based heat exchange unit. The pH and temperature are automatically con- 
trolled by two PID controllers by adjusting the flow rates of acid/base and cold/hot 
water. The glucose substrate is fed continuously into the fermentation reactor in 
open-loop operation in the fed-batch operation mode. 

Fourteen variables are considered in PenSim V2.0 model, shown in Table 4.4: 5 
input variables (1—4, 14) and 9 process variables (5-13). Since variables 11-13 are 
not measured online in industry, only 11 variables are monitored here. 
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Fig. 4.2 Flow chart of the penicillin fermentation process 


Table 4.4 Variables in penicillin fermentation process 


No. 
Aeration rate (L/h) 


Variable 


Agitator power input (W) 


Substrate feed rate (L/h) 


Substrate feed temperature (K) 


Dissolved oxygen concentration (% saturation) 


Culture volume (L) 


Carbon dioxide concentration (mmol/L) 


pH 


DOIN IAJ PB] WI] NI] eR 


Temperature in the bioreactor (K) 


— 
© 


Generated heat (kcal/h) 


— 
an 


Cooling water flow rate (L/h) 


— 
N 


Penicillin concentration 


— 
w 


Biomass concentration 


= 
A 


Substrate concentration 


4.3 Fault Detection Based on PCA, CCA, and PLS 


This section tests the effectiveness of various multivariate statistical methods for the 
TE process. Faults in the standard TE data set are introduced at the 160 sampling. 
For comparison purposes, the normal operation data d00_te is chosen as to train the 
statistical model and faulty operation data d01_te-d21_te is used to test model and 
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detect fault. In the experiments for the PCA and PLS methods, the process variable 
matrix X consists of process variables (XMEAS (1—22)) and manipulated variables 
(XMV (1-11)). XMEAS (35) is used as the quality variable matrix Y for PLS. In the 
CCA experiment, the process variables (XMEAS (1—22)) are used as one data set, 
and the manipulated variables (XMV (1—11)) as another data set. 

The fault detection rate (FDR) and false alarm rate (FAR) are defined as follows: 


No.of samples(J > J| f 4 0) 
total samples(f 4 0) 
No.of samples(J > Jin|f = 0) 
total samples(f = 0) 


FDR = x 100 
(4.1) 


FAR = x 100. 


Experiment and model parameters are determined as follows. The principal com- 
ponents of PCA are determined by the cumulative contribution of 90%. The number 
of principal components of PLS is selected as 6. T? and Q statistics are used to 
monitor process faults. It should be noted that in the monitoring of CCA, (3.18) and 
(3.19) are used as monitoring indices and the corresponding monitoring results are 
slightly different. For 21 fault types, the FDR for PCA, CCA, and PLS based on the 
control limit with 99% confidence level are shown in Table4.5. It can be seen that 
the multivariate statistical methods listed in this section (including PCA, CCA, and 
PLS) can accurately detect the significant process faults. 

Figures 4.3, 4.4, and 4.5 show the different monitoring results base on PCA, CCA, 
and PLS model for typical faults IDV(1), IDV(16), and IDV(20), respectively. Here, 
the black line is the statistic calculated from the real-time data and the red line is the 
normal statistic threshold from the offline model calculation. 

It is easy to find that CCA has better detection for certain fault types from Table 4.5, 
such as faults IDV(10), IDV(16), IDV(19), and IDV(20). The monitoring results for 
faults IDV(16) and IDV(20) are shown in Figs.4.4 and 4.5. Why does CCA show 
better detection capabilities than the other two methods in certain faults? Let’s check 
the setting of process variable X for three methods. In contrast to PCA and PLS, 
CCA splits its X-space directly into two parts and extracts the latent variables by 
examining the correlation between these two parts, i.e., the latent variables extracted 
by CCA can better characterise the changes in the process. 


4.4 Fault Classification Based on FDA 


To further test the effectiveness of fault classification, samples from the 161th to the 
700th of the 21 fault data sets and the normal data sets are used for training FDA 
model. The corresponding data from the 701th to the 960th samples are used to test 
FDA model and its classification ability. FDA in Sect.2.2 is a classical method to 
validate the classification effect and identify the fault types. The following distance 
metric index is introduced to further quantify the difference between different faults: 
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Table 4.5 FDRs of PCA, CCA and PLS 


PCA CCA PLS 

IDV T? SPE T? TS T? SPE 

1 99.13 99.88 99.38 99.63 99.75 99.38 
2 98.38 95.13 95.63 96.13 98.63 97.75 
3 1.00 3.00 0.25 0.50 3.75 1.88 
4 50.88 99.88 100.00 97.38 40.63 96.88 
5 23.75 23.88 100.00 100.00 25.50 25.88 
6 99.00 100.00 100.00 100.00 99.25 100.00 
7 100.00 100.00 100.00 83.00 99.13 100.00 
8 97.00 86.25 87.00 92.25 96.88 96.75 
9 1.50 2.00 0.13 0.13 2.13 2.25 
10 27.88 36.13 78.75 79.38 57.00 31.25 
11 52.50 61.63 77.00 56.88 41.88 65.75 
12 98.38 90.25 97.00 99.00 99.00 96.75 
13 93.75 95.13 94.38 94.25 95.50 94.25 
14 99.88 98.88 100.00 99.88 99.88 100.00 
15 1.25 2.00 0.63 0.75 4.50 1.13 
16 12.13 36.25 85.00 86.63 29.75 19.25 
17 79.50 95.88 91.38 95.25 80.13 89.75 
18 89.13 90.50 89.50 89.50 89.50 89.50 
19 11.63 16.50 84.38 84.25 1.63 13.38 
20 31.13 52.75 70.38 75.50 41.75 45.38 
21 41.25 48.75 26.63 36.88 56.38 43.00 

D2 = || FDA; — FDA; ||, 


where FDA; denotes the FDA feature vector of the ith fault. 

The simulation results are shown in Fig. 4.6. The 22 kinds of data (including the 
normal operation and 21 faulty operation) can be roughly divided into two major 
categories: the first category is the faults that are significantly different from other 
faults, which contains faults IDV(2) (line with ©), IDV(6) (line with *), and IDV(18) 
(line with o); the other category is the set of faults whose characteristics are relatively 
close to each other. 

The faults IDV(1), IDV(2), IDV(6), and IDV(20) are further analyzed. The FDA 
results for fault classification are shown in Fig.4.7. The D2 indices for these faults 
vary considerably, as the classification results clearly illustrated. Conversely, certain 
faults have very small differences in D; indices. For example, faults IDV (4), IDV(11), 
and IDV(14) have the similar FDA D> indices, shown in Fig. 4.8. These faults are 
difficult to classify accurately based on FDA model, as shown in Fig. 4.9. 
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Fig. 4.4 PCA, CCA, and 
PLS monitoring results for 
IDV(16) 
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Fig. 4.5 PCA, CCA, and 
PLS monitoring results for 
IDV(20) 
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Fig. 4.6 D> index for different faults 
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Fig. 4.7 FDA identification result for the fault 1, 2, 6, and 20 


4.4 Fault Classification Based on FDA 


14 T T T T T 
—e— fault 4 

12+ a ——fault 11) 4 
—— fault 14 


of a 


0 2 4 6 8 10 12 14 16 18 20 


Fig. 4.8 Dz indices for fault 4, 11, and 14 


36.36 r 1 
36.355 + + J 
36.35 + J 
© 
Oo Oo z 
36.345 | Ho H ] 
N Oo 
= + Q, % 
36.34 H x+ J 
+ © 
+ 
36.335 + + J 
o fault 4 
36.33 F + fault 11] 4 
* fault 14 
36.325 , , , , l , , , 
-0.918 -0.916 -0.914 -0.912 -0.91 -0.908 -0.906 -0.904 -0.902 


w 


Fig. 4.9 FDA identification result for the fault 4, 11, and 14 
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4.5 Conclusions 


Two kinds of simulation platforms are introduced for verifying the statistical moni- 
toring methods and several experiments based on the traditional methods, PCA, PLS, 
CCA, and FDA, are finished. These basic experiments illustrate the characteristics of 
several methods and their fault detection effects. Actually, there are lots of improved 
methods to overcome the shortcomings and deficiencies of the original multivari- 
ate statistical analysis methods. Each method has its own conditions and scope of 
application. No one method completely outperforms the others in terms of perfor- 
mance. Furthermore, data-based fault detection methods need to be combined with 
the actual monitoring objects, and existing methods need to be improved accord- 
ing to its knowledge and characteristics. So this book focus on the fault detection 
(discrimination) strategies for batch processes and strong nonlinear systems. 
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Chapter 5 A) 
Soft-Transition Sub-PCA Monitoring geit 
of Batch Processes 


Batch or semi-batch processes have been utilized to produce high-value-added prod- 
ucts in the biological, food, semi-conductor industries. Batch process, such as fer- 
mentation, polymerization, and pharmacy, is highly sensitive to the abnormal changes 
in operating condition. Monitoring of such processes is extremely important in order 
to get higher productivity. However, it is more difficult to develop an exact monitor- 
ing model of batch processes than that of continuous processes, due to the common 
natures of batch process: non-steady, time-varying, finite duration, and nonlinear 
behaviors. The lack of exact monitoring model in most batch processes leads that 
an operator cannot identify the faults when they occurred. Therefore, effective tech- 
niques for monitoring batch process exactly are necessary in order to remind the 
operator to take some corrective actions before the situation becomes more danger- 
ous. 

Generally, many batch processes are carried out in a sequence of steps, which are 
called multi-stage or multi-phase batch processes. Different phases have different 
inherent natures, so it is desirable to develop stage-based models that each model 
represents a specific stage and focuses on a local behavior of the batch process. 
This chapter focuses on the monitoring method based on multi-phase models. An 
improved online sub-PCA method for multi-phase batch process is proposed. A 
two-step stage dividing algorithm based on support vector data description (SVDD) 
technique is given to divide the multi-phase batch process into several operation 
stages reflecting their inherent process correlation nature. Mechanism knowledge is 
considered firstly by introducing the sampling time into the loading matrices of PCA 
model, which can avoid segmentation mistake caused by the fault data. Then SVDD 
method is used to strictly refine the initial division and obtain the soft-transition sub- 
stage between the stable and transition periods. The idea of soft-transition is helpful 
for further improving the division accuracy. Then a representative model is built 
for each sub-stage, and an online fault monitoring algorithm is given based on the 
division techniques above. This method can detect fault earlier and avoid false alarm 
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Fig. 5.1 Batch-wise 


unfolding ce 
1 


K 


because of more precise stage division, comparing with the conventional sub-PCA 
method. 


5.1 What Is Phase-Based Sub-PCA 


The general monitoring for batch process is phase/stage-based sub-PCA method, 
which divides the process into several phases (Yao and Gao 2009). The phase-based 
sub-PCA consists of three steps: data matrix unfloding, phase division, and sub- 
PCA modeling. Now the details of them are introduced. 


1. Data Matrix Unfolding 

Different from the continuous process, the historical data of batch process are 
composed of a three-dimensional array X (7 x J x K), where J is the number of 
batches, J is the number of variables, and K is the number of sampling times. The 
original data X should be conveniently rearranged into two-dimensional matrices 
prior to developing statistical models. Two traditional methods are widely applied: 
the batch-wise unfolding and the variable-wise unfolding, with the most used 
method is batch-wise unfolding. The three-dimensional matrix X should be cut 
into K time-slice matrix after the batch-wise unfolding is completed. 

The three-dimensional process data X(J x J x K) is batch-wise unfolded into 
two-dimensional forms X;,(J x J), (k = 1,2,..., K). Then a time-slice matrix 
is placed beneath one another, but not beside as shown in Fig.5.1 (Westerhuis 
et al. 1999; Wold et al. 1998). Sometimes batches have different lengths, i.e. the 
sampling number K are different. The process data need to be aligned before 
unfolding. There are many data alignment methods raised by former researchers, 
such as directly filling zeros to missing sampling time (Arteaga and Ferrer 2002), 
dynamic time warping (Kassida et al. 1998). These unfolding approaches do not 
require any estimation of unknown future data for online monitoring. 
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2. Phase Division 

The traditional multivariate statistical analysis methods are valid in the continuous 
process, since all variables are supposed to stay around certain stable state and 
the correlation between these variables remains relatively stable. Non-steady- 
state operating conditions, such as time-varying and multi-phase behavior, are 
the typical characteristics in a batch process. The process correlation structure 
might change due to process dynamics and time-varying factors. The statistical 
model may be ill-suited if it takes the entire batch data as a single object, and the 
process correlation among different stages are not captured effectively. So multi- 
phase statistic analysis aims at employing the separate model for the forthcoming 
period, instead of using a single model for the entire process. The phase division 
plays a key role in batch process monitoring. 

Many literature divided the process into multi-phase based on mechanism knowl- 
edge. For example, the division is based on different processing units or dis- 
tinguishable operational phases within each unit (Dong and McAvoy 1996; 
Reinikainen and Hoskuldsson 2007). It is suggested that process data can be 
naturally divided into groups prior to modelling and analysis. This stage division 
directly reflects the operational state of the process. However, the known prior 
knowledge usually are not sufficient to divide processes into phases reasonably. 
Besides, Muthuswamy and Srinivasan identified several division points accord- 
ing to the process variable features described in the form of multivariate rules 
(Muthuswamy and Srinivasan 2003). Undey and Cinar used an indicator vari- 
able that contained significant landmarks to detect the completion of each phase 
(Undey and Cinar 2002). Doan and Srinivasan divided the phases based on 
the singular points in some known key variables (Doan and Srinivasan 2008). 
Kosanovich, Dahl, and Piovoso pointed out that the changes in the process vari- 
ance information explained by principal components could indicate the division 
points between the process stages (Kosanovich and Dahl 1996). There are many 
results in this area but not give a clear strategy to distinct the steady phase and 
transition phase (Camacho and Pico 2006; Camacho et al. 2008; Yao and Gao 
2009). 


3. Sub-PCA Modeling 

The statistical models are constructed for all the phases after the phase division and 
are not limited to PCA methods. Here, sub-PCA is representatively one of these 
sub-statistical monitoring methods. The final sub-PCA model of each phase is 
calculated by taking the average of the time-slice PCA models in the correspond- 
ing phase. The number of principal components of each phase are determined 
based on the relative cumulative variance. 

The T*, SPE statistics and their corresponding control limits are calculated 
according to the sub-PCA model. Check the Euclidean distance of the new data 
from the center of each stage of clustering and determine at which stage the new 
data is located. Then, the corresponding sub-PCA model is used to monitor the 
new data. Fault warning is pointed according to the control limits of T? or SPE. 
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5.2 SVDD-Based Soft-Transition Sub-PCA 


Industrial batch process operates in a variety of status, including grade changes, 
startup, shutdown, and maintenance operations. Transitional region between neigh- 
boring stages is very common in multistage process, which shows the gradual 
changeover from one operation pattern to another. Usually the transitional phases 
first show basic characteristic that are more similar to the previous stable phase and 
then more similar to the next stable phase at the end of the transition. The different 
transition phases undergo different trajectories from one stable mode to another, with 
change in characteristics that are more pronounced in sampling time and more com- 
plex than those within a phase. Therefore, valid process monitoring during transitions 
is very important. Up to now, few investigations about transition modeling and mon- 
itoring have been reported (Zhao et al. 2007). Here, a new transition identification 
and monitoring method base on the SVDD division method is proposed. 


5.2.1 Rough Stage-Division Based on Extended Loading 
Matrix 


The original three-dimensional array X(J x J x K) is first batch-wise unfolded 
into two-dimensional form X;. By subtracting the grand mean of each variable over 
all time and all batches, unfolding matrix X+ is centered and scaled. 


_ [Xq— mean (X4)] 
Xk = a (Xp) ; (5.1) 


where mean (X+ ) and a(X;) represent the mean value and the standard variance 
of matrix X,, respectively. The main nonlinear and dynamic components of every 
variable are still left in the scaled matrix. 

Suppose the unfolding matrix at each time-slice is X+. Project it into the principle 
component subspace by loading matrix P, to obtain the scores matrix Tz: 


Xk = TP] + Ex, (5.2) 


where F; is the residual. The first few components in PCA which represent major 
variation of original data set X; are chosen. The original data set X% is divided into 
the score matrix X¥ k= =1;F) and the residual matrix Eg. Here, xX x is PCA model 
prediction. Some useful techniques, such as the cross-validation, have been used to 
determine the most appropriate retained numbers of principal components. Then the 
loading matrix P% and singular value matrix S; of each time-slice matrix X% can be 
obtained. 
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As the loading matrix Px reflects the correlations of process variables, it usually 
is used to identify the process stage. Sometimes disturbances brought by measure- 
ment noise or other reasons will lead wrong division, because the loading matrix 
just obtained from process data is hard to distinguish between wrong data and tran- 
sition phase data. Generally, different phases in the batch process could be firstly 
distinguished according to the mechanism knowledge. 

The sampling time is added to the loading matrix on order to divide the process 
exactly. The sampling time is a continuously increasing data set, so it must also be 
centered and scaled before added to the loading matrix. Generally, the sampling time 
is centered and scaled not along the batch dimension like process data X, but along 
the time dimension in one batch. Then the scaling time tg is changed into a vector 
tg by multiplying unit column vector. So the new time-slice matrix is written as 
P k = (Px, tx], in which tz isa 1 x J column vector with repeated value of current 
sampling time. The sampling time will not change too much with the ongoing of batch 
process, but have an obvious effect on the phase separation. Define the Euclidean 
distance of extended loading matrix P k as 


=[P; —P;, ti — tj] [P; — Pj, ti— t] 


Ki a |2 
Ê,- Ê;| 


i i (5.3) 
=||Pi- Pil +e- 


Then the batch process can be divided into Sı stages using K-means clustering 
method to cluster the extended loading matrices P ks 

Clearly, the Euclidean distance of the extended loading matrix P i includes both 
data differences and sampling time differences. The data at different stages differ 
significantly in sampling time. Therefore, when noise interference makes the data 
at different stages present the same or similar characteristics, the large differences 
in sampling times will keep the final Euclidean distance at a large value. This is 
because the erroneous division data is very different in sampling time from the data 
from the other stages, while the data from the transition stage has very little variation 
in sampling time. We can easily distinguish erroneous divisions in the transition 
phase from those caused by noise. 


5.2.2 Detailed Stage-Division Based on SVDD 


The extended time-slice loading matrices P, represent the local covariance infor- 
mation and underlying process behavior as mentioned before, so they are used in 
determining the operation stages by proper analyzing and clustering procedures. 
The process is divided into different stages and each separated process stage con- 
tains a series of successive samples. Moreover, the transition stage is unsuitable to 
be forcibly incorporated into one steady stage because of its variation complexity of 
process characteristics. The transiting alteration of process characteristics imposes 
disadvantageous effects on the accuracy of stage-based sub-PCA monitoring mod- 
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els. Furthermore, it deteriorates fault detecting performance if just a steady transition 
sub-PCA model is employed to monitor the transition stage. Consequently, a new 
method based on SVDD is proposed to separate the transition regions after the rough 
stage-division which is determined by the K -means clustering. 

SVDD is a relatively new data description method, which is originally proposed 
by Tax and Duin for the one-class classification problem (Tax and Duin 1999, 2004). 
SVDD has been employed for damage detection, image classification, one-class pat- 
tern recognition, etc. Recently, it has also been applied in the monitoring of continu- 
ous processes. However, SVDD has not been used for batch process phase separating 
and recognition up to now. 

The loading matrix of each stage is used to train the SVDD model of transition 
process. SVDD model first maps the data from original space to feature space by a 
nonlinear transformation function, which is called as kernel function. Then a hyper- 
sphere with minimum volume can be found in the feature space. To construct such 
a minimum volume hypersphere, the following optimization problem is obtained: 


mine (R, A, £) = R+C) 6 
(5.4) 
s.t. |ê, -A| <R°+6&,& >0,Vi, 


where R and A are the radius and center of hypersphere, respectively, C gives the 
trade-off between the volume of the hypersphere and the number of error divides. £; 
is a slack variable which allows a probability that some of the training samples can 
be wrongly classified. Dual form of the optimization problem (5.4) can be rewritten 


as 
min) aiK (Bi, Pi) -X an ajk (Pi, Ê) 
i ij 


s.t. 0 < a; < Ci, 


(5.5) 


where K (Pi. P i) is the kernel function, and a; is the Lagrange multiplier. Here, 
Gaussian kernel function is selected as kernel function. General quadratic program- 
ming method is used to solve the optimization question (5.5). The hypersphere radius 
R can be calculated according to the optimal solution a;: 


R= 1-29 ak (Ê, Pi) + y a ajK (Êi, P3) (5.6) 
i=l 


i=1,j=1 


Here, the loading matrices Ê, are corresponding to nonzero parameter ax. It 
means that they have effect on the SVDD model. Then the transition phase can be 
distinguished from the steady phase by inputting all the time-slice matrices P k into 
SVDD model. When a new data Preu is available, the hyperspace distance from the 
new data to the hypersphere center should be calculated firstly 
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x 2 7 x Z 7 PEE 
Preu- a| =1=2) oK (Prw Bi) + DD on aK (P,P). 
i=1 i=1,j=1 

(5.7) 


If the hyperspace distance is less than the hypersphere radius, i.e., D? < R?, the 
process data Pnew belongs to steady stages; else (that is D? > R?), the data will be 
assigned to transition stages. The whole batch is divided into S2 stages at the detailed 
division, which includes S; steady stages and S2 — Sj transition stages. 

The mean loading matrix P, can be adopted to get sub-PCA model of sth stage 
because the time-slice loading matrices in one stage are similar. P, is the mean 
matrix of the loading matrices P, in sth stage. The principal components number 
a; can be obtained by calculating the relative cumulative variance of each principal 
component until it reaches 85%. Then the mean loading matrix is modified according 
to the obtained principal components. The sub-PCA model can be described as 


D? = 


T: = X,P; 
X, = TB? (5.8) 
E; = X; — Xx. 


The T? and SPE statistic control limits are calculated: 


2 as i (I -= 1) 


&,sS,i Ua Tid r . 
=) Uk 2m? ` 
SPEk o = 8k Fh, gs 8k = Im hg = m 


where m+ and vz are the mean and variance of all batches data at time k, respectively, 
as,i is the number of retained principal components in batch i(i = 1, 2,..., I), and 
stage s. I is the number of batches, a is the significant level. 


5.2.3 PCA Modeling for Transition Stage 


Now a soft-transition multi-phase PCA modeling method based on SVDD is pre- 
sented according to the mentioned above. It uses the SVDD hypersphere radius to 
determine the range of transition region between two different stages. Meanwhile, it 
introduces a concept of membership grades to evaluate quantitatively the similarity 
between current sampling time data and transition (or steady) stage models. The 
sub-PCA models for steady phases and transition phases are established respectively 
which greatly improve the accuracy of models. Moreover, they reflect the charac- 
teristic changing during the different neighboring stages. Time-varying monitoring 
models in transition regions are established relying on the concept of membership 
grades, which are the weighted sum of nearby steady phase and transition phase sub- 
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models. Membership grade values are used to describe the partition problem with 
ambiguous boundary, which can objectively reflect the process correlations changing 
from one stage to another. 

Here, the hyperspace distance Dx, is defined from the sampling data at time 
k to the center of the sth SVDD sub-model. It is used as dissimilarity index to 
evaluate quantitatively the changing trend of process characteristics. Correlation 
coefficients A), are given as the weight of soft-transition sub-model, which are 
defined, respectively, as 


Dk,s + Dk, s+1 
As—1 k= 
2 (Dk s-1 + Drs + Drs41) 
Dk s-1 + Dk, s+1 
Ask = : t (5.10) 
2 (Dk s-1 + Drs + Drs41) 
Dk s—1 F Dxs 
As+1,k = 


2 (Dk s—1 + Drs + Des+i) 


where / = s — 1, s, and s + 1 is the stage number, which represent the last steady 
stage, current transition stage, and next steady stage, respectively. The correlation 
coefficient is inverse proportional to hyperspace distance. The greater the distance, 
the smaller the effect of the hyperspatial distance. The monitoring model for the 
transition phase of each time interval can be obtained from the weighted sum of the 


sub-PCA models, i.e., 
s+] 


k= J AP.. (5.11) 


l=s—1 


The soft-transition PCA model in (5.11) properly reflects the time-varying transit- 
ing development. The score matrix 7’, and the covariance matrix S$), can be obtained 
at each time instance. The SPE statistic control limit is still calculated by (5.9). Dif- 
ferent batches have some differences in transition stages. The average T? limits for 
all batches are used to monitor the process in order to improve the robustness of the 
proposed method. The T? statistical control limits can be calculated from historical 
batch data and correlation coefficients. 


rN T4 
t= $ Y Aui ~ (5.12) 


wherei (i = 1,2,..., J) isthe batch number, T is the sub-stage T? statistic control 
limit of each batch which is calculated by (5.9) for sub-stage s. 

Now the soft-transition model of each time interval in transition stages is obtained. 
The batch process can be monitored efficiently by combining with the steady stage 
model given in Sect. 5.2.2. 
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5.2.4 Monitoring Procedure of Soft-Transition Sub-PCA 


The whole batch process has been divided into several steady stages and transition 
stage after the two steps stage-dividing, shown in Sects. 5.2.1 and 5.2.2. The new soft- 
transition sub-PCA method is applied to get detailed sub-model shown Sect. 5.2.3. 
The details of modeling steps are given as follows: 


(1) Get normal process data of J batches, unfold them into two-dimensional time- 
slice matrix, then center and scale each time-slice data as (5.1). 

(2) Perform PCA on the normalized matrix of each time-slice and get the loading 
matrices P;, which represent the process correlation at each time interval. Add 
sampling time ¢ into the loading matrix to get the extended matrices P ke 

(3) Divide the process into S; stages roughly using k-means clustering on extended 

loading matrices Py. Train the SVDD classifier for the original S; steady process 

stages. 

Input again the extended loading matrices P k into the original SVDD model to 

divide explicitly the process into Sj stages: the steady stage and the transition 

stage. Then retrain the SVDD classifier for these new Sz stages. The mean loading 
matrix P, of each new steady stage should be calculated and the sub-PCA model 
is built in (5.8). The correlation coefficients A;,, are calculated to get the soft- 

transition stage model S$; in (5.11) for transition stage t. 

(5) Calculate the control limits of SPE and T? to monitor new process data. 


(4 


wm 


The whole flowchart of improved sub-PCA modeling based on SVDD soft- 
transition is shown in Fig.5.2. The modeling process is offline, which is depending 
on the historical data of 7 batches. 

The following steps should be adopted during online process monitoring. 


(1) Get a new sampling time-slice data x,,-,,, center and scale it based on the mean 
and standard deviation of prior normal J batches data. 

(2) Calculate the covariance matrix Nein the loading matrix Pnew can be 
obtained based on singular value decomposition. Then add sampling time trew 
into it to obtain the extended matrix Pnew. Input the new matrix P new into the 
SVDD model to identify which stages the new data belongs to. 

(3) If current time-slice data belongs to a transition stage, the weighted sum loading 
matrix P is employed to calculate the score vector ty. and error vector €jew, 

thew = Lorne gem 


_ 7 (5.13) 
_ = = I- P’, P’ 
Cnew a Xnew Xnew — Xnew new new 


Or if it belongs to a steady one, the mean loading matrix P, would be used to 
calculate the score vector tnew and error vector enews, 
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Unfold the three-way array X to two- 
dimension time-slice matrix 
X, (k =1,2,---,K) 


PCA decomposing on all time-slice 
matrices get K loading matrices P} 
and singular value matrices S 


loading matrices get K new matrices 


Adding sampling time & into the 
Ê P (k =1,2,. 


Divide the process firstly according to 
Ê, , get I~ S; stages 


Divide the process into S2 stages based 
on SVDD classifier, find the transition 
period. 


Transition 
stage 


Steady 
stage 


Which stage the data 
belongs to? 


Calculate the mean loading matrix of 


Calculate the mean loading matrix of 
each transition stage: P,, P,e, P. 


each steady stage: È: P> 


Get the PCA model of both steady Get the weighted sum PCA model of 


stages. transition stages 
Calculate the control limit of T* and Calculate the control limit of T? and 
SPE of each steady stage. SPE on each sampling time in transition 
stage. 


Fig. 5.2 Illustration of soft-transition sub-PCA modeling 


thew = XnewPs 


_ a. (5.14) 
Cnew = Xnew — Xnew = Xnew (1 = P,P) . 
(4) Calculate the SPE and T? statistics of current data as follows: 
Ta = brews lre 
(5.15) 


SPEnew = enewe oy: 


(5) Judge whether the SPE and T? statistics of current data exceed the control limits. 
If one of them exceeds the control limit, alarm abnormal; if none of them does, 
the current data is normal. 
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5.3 Case Study 


5.3.1 Stage Identification and Modeling 


The Fed-Batch Penicillin Fermentation Process is used as a simulation case in this 
section. A detailed description of the Fed-Batch Penicillin Fermentation Process is 
available in Chap. 4. A reference data set of 10 batches is simulated under nominal 
conditions with small perturbations. The completion time is 400h. All variables are 
sampled every 1h so that one batch will offer 400 sampling data. 

The rough division result based on K-mean method is shown in Fig. 5.3. Originally, 
the batch process is classified into 3 steady stage, i.e. S1 = 3. Then SVDD classifier 
with Gaussian kernel function is used here for detailed division. The hypersphere 
radius of original 3 stages is calculated, and the distances from each sampling data 
to the hypersphere center are shown in Fig. 5.4. 

As can be seen from the Fig. 5.4, the sampling data between two stages, such as 
the data during the time interval 28—42 and 109-200, are obviously out of the hyper- 
sphere. That means the data at this two time regions have significant difference from 
that of other steady stage. Therefore, these two stages are considered as transition 
stage. The process was further divided into 5 stages according to the detailed SVDD 
division, shown in Fig. 5.5 
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Fig. 5.3 Rough division result based on K-mean clustering 
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Fig. 5.4 SVDD stage classification result 
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Fig. 5.5 Detailed process division result based on SVDD 


It is obviously that the stages during the time interval 1-27, 43—109 and 202—400 
are steady stages. The hyperspace distance of stage 28-42, 109-200 exceeded the 
radius of hypersphere obviously, so the two stages are separated as transition stage. 
Then the new SVDD classifier model is rebuilt. The whole batch process data set 
is divided into five stages using the phase identification method proposed in this 
chapter, that is S2 = 5. 
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5.3.2 Monitoring of Normal Batch 


Monitoring results of the improved sub-PCA methods for the normal batch are pre- 
sented in Fig.5.6. The blue line is the statistic corresponding to online data and the 
red line is control limit with 99% confidence, which is calculated based on the normal 
historical data. It can be seen that as a result of great change of hyperspace distance 
at about 30h in Fig. 5.4, the T? control limit drops sharply. The T? statistic of this 
batch still stays below the confidence limits. Both of the monitoring systems (T? and 
SPE) do not yield any false alarms. It means that this batch behaves normally during 
the running. 


5.3.3 Monitoring of Fault Batch 


Monitoring results of the proposed method are compared with that of traditional 
sub-PCA method in order to illustrate the effectiveness. Here two kinds of faults 
are used to test the monitoring system. Fault | is the agitator power variable with a 
decreasing 10% step at the time interval 20-100. They are shown in Figs. 5.7 and 5.8 
that SPE statistic increases sharply beyond the control limit in both methods, while 
T? statistic which in fact reflects the changing of sub-PCA model did not beyond the 
control limit in traditional sub-PCA method. That means the proposed soft-transition 
method made a more exact model than traditional sub-PCA method. 
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Fig. 5.6 Monitoring plots for a normal batch 
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Fig. 5.7 The proposed soft-transition monitoring for fault 1 
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Fig. 5.8 The traditional Sub-PCA monitoring for fault 1 
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Fig. 5.9 Projection in 3 
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Fig. 5.10 Projection in 3 
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The differences between these two methods can be seen directly at the projection 
map, i.e. Figs. 5.9 and 5.10. The blue dot is the projection of data in the time interval 
50-100 to the first two principal components space, and the red line is control limit. 
Figure 5.10 shows that none of the data out of control limit using the traditional sub- 
PCA method. The reason is that the traditional sub-PCA does not divide transition 
stage. The proposed soft-transition sub-PCA can effectively diagnose the abnormal 
or fault data, shown in Fig. 5.9. 

Fault 2 is a ramp decreasing with 0.1 slopes which is added to the substrate feed 
rate at the time interval 20-100. Online monitoring result of the traditional sub-PCA 
and proposed method are shown in Figs.5.11 and Fig.5.12. It can be seen that this 
fault is detected by both two methods. The SPE statistic of the proposed method is 
out of the limit about at 50h and the T? values alarms at 45h. Then both of them 
increase slightly and continuously until the end of fault. 
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Fig. 5.11 Proposed Soft-transition monitoring results for fault 2 
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Fig. 5.12 The traditional Sub-PCA monitoring for fault 2 


Itis clearly shown in Fig. 5.12 that the SPE statistic of traditional sub-PCA did not 
alarm until about 75h, which lags far behind that of the proposed method. Meanwhile, 
the T? statistic has a fault alarm at the beginning of the process. It is a false alarm 
caused by the changing of process initial state. In comparison, the proposed method 
has fewer false alarms, and the fault alarm time of the proposed method is obviously 
ahead of the traditional sub-PCA. 
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Table 5.1 Monitoring results of FA for other faults 


Fault | Var. |Fault | M/S | Fault Soft-transition sub-PCA| Trad. sub PCA (Camacho 

ID No. |type |(%) time (h) and Pico 2006) 
Time | Time | FA Time | Time | FA 
(SPE)| (T) (SPE)| (T>) 

1 2 Step |—15 20 20 28 |O 20 [none | 9 

2 2 Step |—15 | 100 100 | 100 |0 100 |101 1 

3 3 Step |—10 | 190 190 |199 |O 190 |213 |11 

4 3 Step |—10 30 48 45 |0 81 45 5 

5 1 Step |—10 20 20 20 JO 20 48 1 

6 1 Step |—10 | 150 150 | 151 JO 150 |151 2 

T 3 Ramp| —5 20 28 40 |0 28 41 1 

8 2 Ramp | —20 20 31 45 JO 44 34 6 

9 1 Ramp | —10 20 24 30 |0 21 28 | 10 

10 3 Ramp) —0.2 | 170 171 |171 JO 170 |173 3 

11 2 Ramp| —20 | 170 181 |195 JO 177 |236 1 

12 1 Ramp|—10 | 180 184 | 188 JO 185 | 185 2 


The monitoring results for other 12 different faults are presented in Table 5.1. The 
fault variable No. (1, 2, 3) represents the aeration rate, agitator power and substrate 
feed rate, respectively, as shown in Chap. 4. Here FA is the number of false alarm 
during the operation life. 

It can be seen that the false alarms of the conventional sub-PCA method is obvi- 
ously higher than that of the proposed method. In comparisons, the proposed method 
shows good robustness. The false alarms here are caused by the little change of the 
process initial state. The initial states are usually different in real situation, which 
will lead to the changes in monitoring model. Many false alarms are caused by these 
little changes. The conventional sub-PCA method shows poor monitor performance 
in some transition stage and even can’t detect these faults because of the inaccurate 
stage division. 


5.4 Conclusions 


In a multi-stage batch process, the correlation between process variables changes as 
the stages are shifted. It makes MPCA and traditional sub-PCA methods inadequate 
for process monitoring and fault diagnosis. This chapter proposes a new phase iden- 
tification method to explicitly identify stable and transitional phases. Each phase 
usually has its own dynamic characteristics and deserves to be treated separately. 
In particular, the transition phase between two stable phases has its own dynamic 
transition characteristics and it is difficult to identify. 
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Two techniques are adopted in this chapter to overcome the above problems. 
Firstly, inaccurate phase delineation caused by fault data is avoided in the rough 
division by introducing sampling times in the loading matrix. Then, based on the 
distance of the process data to the center of the SVDD hypersphere, transition phases 
can be identified from nearby stable phases. Separate sub-PCA models are given for 
these stable and transitional phases. In particular, the soft transition sub-PCA model 
is a weighted sum of the previous stable stage, the current transition stage and the 
next stable stage. It can reflect the dynamic characteristic changes of the transition 
phase. 

Finally, the proposed method is applied to the penicillin fermentation process. 
The simulation results show the effectiveness of the proposed method. Furthermore, 
the method can be applied to the problem of monitoring any batch or semi-batch 
process for which detailed process information is not available. It is helpful when 
identifying the dynamic transitions of unknown batch or semi-batch processes. 
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Chapter 6 A) 
Statistics Decomposition and Monitoring | as 
in Original Variable Space 


The traditional process monitoring method first projects the measured process data 
into the principle component subspace (PCS) and the residual subspace (RS), then 
calculates T? and SPE statistics to detect the abnormality. However, the abnormality 
by these two statistics are detected from the principle components of the process. 
Principle components actually have no specific physical meaning, and do not con- 
tribute directly to identify the fault variable and its root cause. Researchers have 
proposed many methods to identify the fault variable accurately based on the projec- 
tion space. The most popular is contribution plot which measures the contribution 
of each process variable to the principal element (Wang et al. 2017; Luo et al. 2017; 
Liu and Chen 2014). Moreover, in order to determine the control limits of the two 
statistics, their probability distributions should be estimated or assumed as specific 
one. The fault identification by statistics is not intuitive enough to directly reflect the 
role and trend of each variable when the process changes. 

In this chapter, direct monitoring in the original measurement space is investi- 
gated, in which the two statistics are decomposed as a unique sum of the variable 
contributions of the original process variables, respectively. The monitoring of the 
original process variables is direct and explicit in the physical meaning, but it is 
relatively complicated and time consuming due to the need to monitor each vari- 
able in both SPE and T? statistics. To address this issue, a new combined index is 
proposed and interpreted in geometric space, which is different from other com- 
bined indices (Qin 2003; Alcala and Qin 2010). The proposed combined index is 
an intrinsic method. Compared with the traditional latent space methods, the com- 
bined index-based monitoring does not require the prior distribution assumption to 
calculate the control limits. Thus, the monitor complexity is reduced greatly. 
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6.1 Two Statistics Decomposition 


According to the traditional PCA method, the process variables x could be divided 
into two parts: principal component x and the residual e: 


x =tP'+e=- +e, (6.1) 


where P is the matrix associated with the loading vectors that define the latent variable 
space, £ is the score matrix that contains the coordinates of x in that space, and e is 
the matrix of residuals. T? and SPE statistics are used to measure the distance from 
the new data to the model data. Generally, T and SPE statistics should be analyzed 
simultaneously so that the cumulative effects of all variables can be utilized. However, 
most of the literatures have only considered the decomposition of T?. Therefore, this 
chapter considered the SPE statistical decomposition to obtain the original process 
variables monitored in T? and in the SPE statistical space. 


6.1.1 T? Statistic Decomposition 


The statistic can be reformulated as follows: 


J "i 
T? := D = tA™!t" =xPA'!P'x' = x Ax" = b> pes >0, (6.2) 


i=l j=l 


where A = PA~'P? > 0, AT! is the inverse of the covariance matrix estimated 
from a reference population, and a; j is the element of matrix A. 
One of the T? statistic decompositions (Birol et al. 2002) is given as follows: 


J 
D= do | (xe — af)" + (9? - x2?)] 
k=l 
J 
= Yo akc ee — xXx) | (6.3) 
k=l 
J 
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where the cp is the decomposed T? statistic of each variable x;,. Next, the T? statistic 
of each variable x; can be calculated as follows: 


cP = akk [x — xXx) | : (6.4) 


The detailed T? statistic decomposition process is not shown in here, details can 
be found in Alvarez et al. (2007, 2010). 


6.1.2 SPE Statistic Decomposition 


The SPE statistic, which reflects the change of the random quantity in the residual 
subspace, also has a quadratic form: 


SPE := Q = ee’ =x (I — PP") (I — PP") x" 


tOo eae (6.5) 
= x Bx =>) Di, jXiXj; 


i=1 j=l 


where B = (I — PP”) (I — PPT)", bi, j is the element of matrix B, and b; j = bj i. 
Similar to the decomposition of T? statistic, SPE statistics can also be decomposed 
into a series of new statistic of each variable. 

Firstly, the SPE statistic Q can be reformulated in terms of a single variable xz: 


J J J 
Q=Q=brax |2 X bjx fart D> JO baxx (6.6) 


j=1,jÆk i=1,iżk j=1,jÆk 


The minimum value of Q; can be calculated as 


AQ “ay > A 
= by px* +2 5 bk jx; =0 > xf =- 2 br jXj [Dek (6.7) 
k j=l, jFk j=l. jk 
J J 
Qm — pa —by ext? + >. > bi jXiXj. (6.8) 
i=1,iŻk j=l, jk 


The difference between the SPE statistic of x; and Qt” is 


Q= QR = bpr (xk — ay (6.9) 


The sum of the Qi" fork = 1,2,..., J is 
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J 


J J J 
a qm — > by exe? + > > Dj, XiX 
k=l k=l i=1i¢k j=l, j¢k (6.10) 


J 
= (J —2)Q4 Yo dig (xg — 37°). 


k=1 


The SPE statistic obtained from (6.10) can be evaluated as the sum of the contri- 
butions of each variable xg: 


J 
by 
Q=) E [C - + G2 -27)] 

k= 
J 

=J bı [(x} — xfxx)] (6.11) 
k= 
J 

= a" 
k= 


The original process variables of the SPE statistic are used to monitor the system 
status: 


qe = be le = xx) | $ (6.12) 

So the novel SPE statistic can be evaluated as a unique sum of the contributions 

of each variable a (k = 1,2,..., J), which is used for original process variable 
monitoring. 


6.1.3 Fault Diagnosis in Original Variable Space 


Similar to other PCS monitoring strategies, the proposed original variable monitoring 
technique consists of two stages that are executed offline and online. Firstly, the 
control limits of the two statistics (T? and SPE) for each time interval are determined 
by reference population of normal batches in the offline stage. Next, two statistics 
are calculated at each sampling during the online stage. If one of statistics exceeds 
the established control limit, then a faulty mode is declared. 

The historical data of the batch process are composed of a three-dimensional array 
X(U x J x K), where J, J, and K are the number of batches, process variables, and 
sampling times, respectively. The three-dimensional process data must be unfolded 
into two-dimensional forms X; (I x J), k = 1,2,..., K before performing the PCA 
operation. The unfolding matrix X% is normalized to zero mean and unit variance in 
each variable. The main nonlinear and dynamic components of the variable are still 
left in the scaled data matrix X;. 
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The normalized data matrix X+ is projected into principal component subspace 
by loading matrix P, to obtain the scores matrix Tx: 


X, =TP] + Ex, 


where E, is the residual matrix. The two statistics associated with the ith batch for 


the jth variable in kth time interval are defined as cr je and ge 


i, j,k° 

The control limit of a continuous process can be determined by using the kernel 
density estimation (KDE) method. Another method has been used for calculating 
the control limit for batch process, which is determined by the mean and variance of 
each statistic (Yoo et al. 2004; Alvarez et al. 2007). The mean and variance of cP jk 


are calculated as follows: 


B=D 


(6.13) 
-D 2 
var ( =D eta- Y/U- 1). 
The control limit of statistic cp j,k ÍS estimated as 
imil 1/2 
cja = Gpe + Au(var (cpa) (6.14) 
where À; is a predefined parameter. Similarly, the control limit of statistic is 
1 
Ge = = ay +2 (var (iy) R (6.15) 
where Az is a predefined parameter, 
din = 3 qiga 
(6.16) 


var (qs) = 3 Ce e - 1) 


As above, the control limit calculation is very simple. Although the calculation 
increases, the extra calculations can be performed offline, there is no restriction during 
the online monitoring stage. The proposed monitoring technique corresponding to 
the offline and online stages is summarized as follows: 


A. Offline Stage 


1. Obtain the normal process data of J batches X, unfold them into two-dimensional 
time-slice matrix X;,, and then normalize the data. 
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2. Perform the PCA procedure on the normalized matrix X% of each time slice and 
obtain the loading matrices P,. 

3. Calculate the statistics c”, , and q>4", of each variable in all the interval times for 
all batches, then calculate the variable contributions at each time interval using 
(6.4) and (6.12). 

4. The control limits of statistics Ca jk and qi are estimated as (6.14) and (6.15). 


B. Online Stage 


1. Collect new sampling time-slice data Xnew, and then normalize based on the mean 
and variance of prior normal J batches data (modeling data). 

2. Use Px to calculate the new statistics cP, , and q>t¥, of new sampling, and judge 

whether these statistics exceed the control limit. If one of them exceeds the control 

limit, then fault identification is performed to find the faulty variable that exceeds 

the control limit much greater than others; if none of them exceeds the control 


limit, then the current data are normal. 


6.2 Combined Index-Based Fault Diagnosis 


The monitoring method in the original process variables can avoid some of the 
disadvantages of traditional statistic approach in the latent variable space, such as 
indirectly monitoring (Yoo et al. 2004). However, the original variable monitoring 
method is relatively complicated due to the monitoring of each variable in both 
SPE and T? statistics. It means that each variable should be monitored twice, which 
increases the calculation. Thus, a new combined index, composed of the SPE and T? 
statistics, is proposed to decrease monitoring complexity. 


6.2.1 Combined Index Design 


In this section, we use symbol X(J x J) to substitute the unfolding process data 
matrix X;(/J x J) for general analysis. Similarly, Pz, T, Ex are substituted by 
P,T, E. The process data X could be decomposed into PCS and RS when perform- 
ing PCA: 

X=TP'+E=X+E, (6.17) 


where X is the PCS and E is the RS. If the principal number is m, then a PCS with 
m-dimension and a RS with (J — m)-dimension can be obtained. When new data x 


are measured, they are projected into the principal subspace: 


t=xP. (6.18) 
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Fig. 6.1 Graphical representation of T? and SPE statistics 


The principal component (PC) score vector (1 x m) is the projection of new data 

x in the PCS. Subsequently, the PC score vector is projected back into the original 

process variables to estimate the process data ê = t PT. The residual vector e is 
defined as 

e=x-—ŝ=x(I— PP"). (6.19) 


Residual vector e reflects the difference between new data x and modeling data 
X in the RS. A graphical interpretation of T? and SPE statistics is shown in Fig. 6.1. 

To describe the statistics clearly in the geometry, the principal component subspace 
is taken as a hyperplane. The SPE statistic checks the model validity by measuring 
the distance between the data in the original process variables and its projection onto 
the model plain. Generally, the T? statistic is described by the Mahalanobis distance 
of the project point ¢ to the projection center of normal process data, which aims to 
check if the new observation is projected into the limits of normal operation. The 
residual space is perpendicular to the principal hyperplane. The SPE statistic shows 
the distance from the new data x to the principal component hyperplane. 

A new distance index y from the new data to the principal component projection 
center of the modeling data is given in the following. It can be used for monitoring 
instead of the SPE and T? indicators. Consider the singular value decomposition 
(SVD) of the covariance matrix R, = E (x TX ) for given normal data X, 


R, =UAU', 


where A = diag{,, A2,.--, Am, 07—m} is the eigenvalue of R,.. The original loading 
matrix U zxy is a unitary matrix and UUT = I. Each column of the unitary matrix 
is a set of standard orthogonal basis in its span space. The basis vectors of principal 
component space and residual space divided from matrix U are orthogonal to each 
other. Furthermore, 

U =[P, P4], (6.20) 


where P € R/*"" is the loading matrix. P, € R’*‘/~” can be treated as the loading 
matrix of residual space. Thus, P and P, are presented by U as follows: 


P=UF,, P.=UF», (6.21) 


86 6 Statistics Decomposition and Monitoring in Original Variable Space 
where 
I 0 
r=, A ee , (6.22) 
J=m]Jxm J=m]jJxm 


where Im and I ;_,, are the m and J — m dimension unit matrices, respectively, and 
0,, and 0;_,, are the m and J — m dimension zero matrices, respectively. Further- 
more, the SPE and T? statistics are denoted by U: 


e = x (I — PP”) = x (UU —- UF, FĪ U") 


(6.23) 
= x (UUT — U E,U") = xU (I — E1) U7 = xU EU", 
where 
— I m Ons —m a On Ons —m 
i B haa Oia | i Ee = he Ij-m | P 20) 
Define y = xU, then 
SPE := Q = ee" = xU E UTU EU" x" 
J 
(6.25) 
= xUE,U'x' = yEy" = > yp. 
i=m+1 
Similarly, we can describe the T? statistic as follows: 
T? := D = tA, t" =xPA;'P'x! 
= xU F, A3! FTUTx" =xUA ‘U'x" 
(6.26) 


m 
Sgn ty) =X yo}, 
i=1 


where 
-1 . 2 2 2 =i -1 
An = diag{oy,03,..-, 0m}, A~ = [Am > Ou -mx -m)]. 
The new combined index could be obtained directly by composing the two statis- 
tics as 


m J 
g=D+Q=) yo? + Do y (6.27) 


i=l i=m+1 


It is proved via mathematical illustration that the two decomposed statistics can 
be geometrically added together directly. This result demonstrates that T? and SPE 
statistic can be combined primarily and that is an intrinsic property. Thus, the com- 
bined index is a more general and geometric representation compared with the other 
combined index. The monitoring strategy with the novel index is introduced in the 
next subsection. 
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6.2.2 Control Limit of Combined Index 


In Sect. 6.1, the T? and SPE statistics are decomposed into two new statistics for each 
variable. To reduce the calculation of process monitoring, the two new statistics are 
combined into a new statistic y to monitor the process. 


Dijk = Cie + Gres (6.28) 


where yj, j, is the combined statistic at sampling time k for the jth variable. The 
method mentioned in Sect. 6.1.3 can be used to calculate the control limit of the new 
statistic, 

gin" = Piet (var (pix). (6.29) 


where « is a predefined parameter, and 


I 
Pie =X pijl 
i=1 
d 2 
var (pja) = >> (ijk — Bix) /0 — 1). 


i=l 


(6.30) 


The online process monitoring can be performed according to comparing the new 
statistic and its control limit. There are several points to highlight for readers when the 
proposed control limit is used. Firstly, the mean and variance may be inaccurate for a 
small number of samples. As a result, a sufficient number of training samples should 
be collected during the offline stage. Secondly, the predefined parameter is important 
and it is designed by the engineers according to the actual process conditions. The 
tuning method regarding « is similar to the Shewhart control chart. Equation (6.29) 
illustrates that the effect of variance depends on the predefined parameter « and 
the fluctuation of control limits also relies on it on each sample. For example, the 
control limit is smooth when « is selected to be a smaller value, and the control limit 
fluctuates when « is selected to be a larger value. 

If the combined statistic of the new sample has a significant difference from those 
of the reference data set, then a fault is detected. As a result, a fault isolation procedure 
is set up to find the fault roots. This fault response process is one of advantages in 
original process variable monitoring as each variable has a unique formulation and 
physical meaning. The proposed monitoring steps are similar as that in Sect. 6.1.2. 
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6.3 Case Study 


A fed-batch penicillin fermentation process is considered in case study, and its 
detailed mathematical model is given in Birol et al. (2002). A detailed description 
of the fed-batch penicillin fermentation process is available in Chap. 4. 


6.3.1 Variable Monitoring via Two Statistics Decomposition 


Firstly, the original process variable monitoring algorithm mentioned in Sect. 6.1.2 
is tested. The monitoring results of all variables would be interminable and tedious, 
so only several typical variables are shown here for demonstration or comparison. 
The monitoring result of variable 1 in a test normal batch is shown in Fig. 6.2. None 


of the two statistics (Gi g and Ge ) exceeds its control limit, and the statistics (ce? k 
and ge, j =2,..., 11) of all the other variables do not exceed the control limits 


as well. The monitoring results of other variables are similar to that of variable 1, so 
we omitted them due to the restriction of the book length. These results show that 
proposed algorithm do not have a false alarm when it is used to monitor the normal 
batch. 

Next, the fault batch data are used to test the proposed monitoring algorithm of 
the original process variables, and two types of faults are chosen here. 

Fault 1: step type, e.g., a 20% step decrease is added in variable 3 at 200-250h. 
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Fig. 6.2 Original variables monitoring for normal batch (variable 1) 
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Fig. 6.3 Monitoring result for Fault 1 (variable 1) 


The monitoring results are shown as follows. Figure 6.3 shows the monitoring 
result of variable 1 for fault 1, the statistics changes obviously during the fault 
occurrence. However, the statistics do not exceed the control limit, i.e., the process 
status exhibits changes, but variable 1 is not the fault source. The monitoring results 
of variables 2, 4, 8, 9, and 11 are almost the same as the result of variable 1, and 
these results are not presented here. 

The monitoring results of variable 3 and variable 5 are shown in Figs. 6.4 and 6.5, 
respectively. Both of the variable statistics exceed the control limit at the sampling 
time 200h. Regarding the other variables of 6, 7, and 10, the statistics of these 
variables also exceed the control limit, and the simulation results of these variable 
are nearly the same as that of variable 5 (the results are not presented here). 

The question is: which variable is the fault source, variable 3, 5, or others? From 
the amplitude of Figs. 6.4 and 6.5, it is easy to see that the two statistics for variable 
3 exceed the control limits to a much greater extent than those for variable 5 and 
other variables. In particular, the Q statistic of variable 3 is 40 times greater than 
its control limit. From this perspective, variable 3 can be concluded to be the fault 
source, as it makes contribution to the statistics obviously. Note that there is no 
smearing effect in the proposed method. The smearing effect means that non-faulty 
variables exhibit larger contribution values, while the contribution of faulty variables 
is smaller. Because the statistics are decomposed into a unique sum of the variable 
contributions, each monitoring figure is plotted against the decomposed variable 
statistics. Furthermore, the proposed method may identify several faulty variables if 
they have larger contributions at close magnitudes. 
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Fig. 6.4 Monitoring result for Fault 1 (variable 3) 
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Fig. 6.5 Monitoring result for Fault 1 (variable 5) 
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Fig. 6.6 Relative contribution rate of Re for Fault 1 


To confirm the monitoring conclusion, the relative statistical contribution rate of 
the jth variable at time k is defined as 


J 
Dk D 
RE = Pel DCP 
j=l 
J 
jk _ SPE SPE 
Ry = Fu | 2 Tjk 
j=l 


The relative statistic contribution rates of 11 variables are shown in Figs. 6.6 
and 6.7. It is clear that variable 3 is the source of Fault 1. It is found that variables 
9, 10, and11 still have the higher contribution when the fault is eliminated because 
the fault in variable 3 causes the change of the other process variables. The effects 
on whole process still continue, even if the fault is eliminated, and the fault variable 
evolves from the original variable 3 to other process variables. 

Fault 2: ramp type, i.e., fault involving a ramp increasing with a slope of 0.3 in 
variable 3 at 20-80 h. 

The two monitor statistics of variable 3 are shown in Figs. 6.8 and 6.9. It can be 
seen that both of the two statistics exceed the control limits at approximately 50h. 
The alarming time lags relative to the fault occurrence time (approximately 20h) are 
found because this fault variable changes gradually. When the fault is eliminated after 
80h, the relationship among the variables changes back to normal. The T? statistic 
obviously declines under the control limit, while the SPE statistic still exceeds the 
control limit because the error caused by Fault 2 still exists. 
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Fig. 6.7 Relative contribution rate of Ry for Fault 1 
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Fig. 6.8 Fault 2 monitoring by c statistic (variable 3) 
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Fig. 6.9 Fault 2 monitoring by q statistic (variable 3) 


6.3.2 Combined Index-Based Monitoring 


The same test data in Sect. 6.3.1 are used to test monitoring effectiveness of the new 
combined index. Considering a normal batch, the monitoring result of ọ statistic is 
shown in Fig.6.10. Variable 1 is still monitored in this section, as was the case in 
Sect. 6.3.1 for comparison. It is shown that the new index y of variable 1 is far below 
its control limit, as is the case for the new index values of the other variables. This 
method shows some good performances, and the number of false alarms is zero in 
normal batch monitoring. The new index is more stable than the two statistics, and 
it is easy to observe for operators. 

Fault 1: step type, e.g., a 20% step decrease is added in variable 3 at 200-250h. 

The new statistic p of variable 1 does not exceed the control limit in Fig.6.11, 
although it changes from 200h to 250h during the fault. The values of new statistic 
y of variables 2, 4, 8, 9, and 11 also do not exceed the control limit. The corre- 
sponding monitoring statistics are omitted here. Thus, these variables have no direct 
relationship with the fault variable, i.e., they are not the fault source. 

Furthermore, the monitoring results of variables 3 and 5 are shown in Figs. 6.12 
and 6.13, respectively. The value statistics of variables 3 and 5 exceed their control 
limits obviously, as well as those of variables 6, 7, and 10. As discussed in Sect. 6.3.1, 
one can see that the statistic y of variable 3 changes to a greater extent than other vari- 
ables, so variable 3 is the potential fault source. This result shows that the proposed 
approach is an efficient technique for fault detection. 
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Fig. 6.10 Original variables monitoring based on combined index for normal batch (variable 1) 
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Fig. 6.11 Fault 1 monitoring based on combined index (variable 1) 
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Fig. 6.12 Fault 1 monitoring based on combined index (variable 3) 
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Fig. 6.13 Fault 1 monitoring based on combined index (variable 5) 
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The relative contribution of the new statistic is used to confirm the fault source, 
which is defined as 


J 
Ri = vial Y Pik 
j=l 


The relative contribution of variable 3 is nearly 100%, as shown in Fig.6.14. So 
variable 3 is confirmed as the fault source. It is found that variables 9, 10, and 11 still 
have a higher contribution when the fault is eliminated because the fault in variable 
3 causes the change of the other process variables and the effect on whole process 
still continues, even if the fault is eliminated. 

Note that the relative contribution plot (RCP) is an auxiliary tool to locate the 
fault roots. It is only used for comparison with the proposed monitoring method to 
confirm diagnostic conclusions. Furthermore, the RCP is completely different from 
the traditional contribution diagram in this work. The RCP in this work is calculated 
using the original process variables, i.e., there is no smearing effect of the RCP. 
The contribution of each variable is independent of the other variables. Therefore, 
the proposed method is a novel and helpful approach in terms of original process 
variable monitoring. Furthermore, the color map of the fault contribution is intuitive. 
As a result, the map will promote the operator’s initiative to find the fault source, 
and engineers can find some useful information to avoid more serious accidents. 

Fault 2: ramp type, i.e., fault involving a ramp increasing with a slope of 0.3 in 
variable 3 at 20-80 h. 
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Fig. 6.14 Relative contribution rate of ọ statistic for Fault 1 
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Fig. 6.15 Fault 2 monitoring of variable 3 by ọ statistic 


The monitoring result of variable 3 is shown in Fig. 6.15. It can be seen that the new 
statistic y exceeds the control limit at approximately 50h, and then it falls below the 
control limit after 80h. The result shows that the combined index can detect different 
faults. 


6.3.3 Comparative Analysis 


The monitoring performances of different methods are compared. Several perfor- 
mance indices are given to evaluate the monitoring efficiency. False alarm (FA) is 
the number of false alarms during the operation life. Time detected (TD) is the 
time that the statistic exceeds the control limit under the fault operation, which can 
represent the sensitivity. 

The monitoring results of the proposed method are compared with that of the 
traditional sub-PCA method (Lu et al. 2004) in latent space and the soft-transition 
sub-PCA (Wang et al. 2013) to illustrate the effectiveness. The FA and TD results 
for other 12 faults are presented in Tables6.1 and 6.2, respectively. Fault variable 
numbers (1, 2, and 3) represent the aeration rate, agitator power, and substrate feed 
rate, as shown in Chap.4. The fault type and occurring time for the variables are 
given in Table6.1, and those input conditions are as same as those in Sects. 6.3.1 
and 6.3.2. 
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Table 6.1 Monitoring results of FA for other faults 


Fault | Var. | Fault | M/S | Fault | Original Trad. sub-PCA | Soft sub-PCA 
ID No. |type |(%) | time variables (Lu et al. 2004) | (Wang et al. 
(h) monitoring 2013) 
c q y FA FA 

1 2 Step |—15 | 20 0 0 |0 9 0 

2 2 Step |—15 |100 0 173 |O 1 0 

3 3 Step |—10 | 190 0 95 |O 11 0 

4 3 Step |—10 30 0 57 |O 5 0 

5 1 Step |—10 20 0 0 JO 1 0 

6 1 Step |—10 | 150 |16 0 JO 2 0 

7 1 Ramp | —5 20 2 1 JO 1 0 

8 Ramp | —20 20 4 0 JO 6 0 

9 Ramp | —10 20 2 O |1 10 0 

10 3 Ramp | —0.2 | 170 1 0 JO 3 0 

11 2 Ramp |—20 | 170 4 0 JO 1 0 

12 1 Ramp|—10 | 180 2 0 JO 2 0 


It can be seen from Table 6.1 that there are multiple false alarms applying the tra- 
ditional sub-PCA method to detect faults, while the original process variable moni- 
toring method shows less false alarms based on the combined index y in this chapter. 
Among the three indices of the original spatial monitoring, the c and q statistics may 
have a large number of false alarms for different reasons, but the new combined index 
i is more accurate because it can balance the two indices. 

Table 6.2 indicates that the original process variable monitoring has accurate and 
timely detection results comparing with the other two detection methods. The detec- 
tion delay is more than 10h for Fault 4, 7, 8 and 11 in the traditional sub-PCA and 
the soft-transition sub-PCA. Such a delay is inconceivable in a complex industrial 
process. While the difference between the detected time and the real fault time for 
the proposed approach is less than 10h, except for fault 4. This result is helpful 
and meaningful in practice. As a result, the proposed approach could provide more 
suitable process information to operators. Thus, the proposed monitoring method 
based on a combined index shows advantages of rapid detection and fewer false 
alarms compared with the traditional or soft-transition sub-PCA approaches, whose 
monitoring operation is in the latent space but not the original measurement space. 
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Table 6.2 Comparing the time of fault detected 


Fault | Fault Original process Trad. sub-PCA Soft-trans. sub-PCA 
ID time variables monitoring (Lu et al. 2004) (Wang et al. 2013) 
(h) 
c q p SPE |T? SPE |T? 

1 20 20 20 20 20 None 20 28 

2 100 100 100 100 100 101 100 100 

3 190 191 190 190 190 213 190 199 

4 30 45 45 45 81 45 48 45 

5 20 20 20 20 20 48 20 20 

6 150 151 150 150 150 151 150 151 

7 20 27 26 25 28 41 28 40 

8 20 30 26 26 44 34 31 45 

9 20 24 22 23 21 28 24 30 

10 170 171 170 170 170 173 171 171 

11 170 179 175 175 177 236 181 195 

12 180 184 182 182 185 185 184 188 


6.4 Conclusions 


A new multivariate statistical method for the monitoring and diagnosis of batch 
processes, which operates on the original process variables, was presented in this 
chapter. The proposed monitoring method is based on the decomposition of the T? 
and SPE statistics as a unique sum of each variable contribution. However, problems 
may arise if the number of variables is large when the original process variables 
technique is applied. To reduce the workload of the monitoring calculation, a new 
combined index was proposed. A mathematical illustration was given to prove that the 
two decomposed statistics can be added together directly. Compared to the traditional 
PCA method in latent space, the proposed method is sufficiently direct, and only one 
statistical index is utilized, thereby decreases the calculation burden. 

The new original variable space monitoring method can detect a fault with a clear 
result based on each variable. The fault source can be determined directly from the 
statistical index rather than using the traditional contribution plot. Furthermore, the 
control limit of the new combined statistics is very simple, and it does not need to 
assume that it follows some probability distribution. The simulation results show 
that the new combined statistics can detect the fault efficiently. As the new statistic 
index is the combination of two decomposed statistics, it can avoid many problems 
introduced by the use of a single statistic, such as false alarms or missing alarms. 
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Chapter 7 A) 
Kernel Fisher Envelope Surface for get 
Pattern Recognition 


It is found that the batch process is more difficultly monitored compared with the 
continuous process, due to its complex features, such as nonlinearity, non-stable 
operation, unequal production cycles, and most variables only measured at the end 
of batch. Traditional methods for batch process, such as multiway FDA (Chen 2004) 
and multi-model FDA (He et al. 2005), cannot solve these issues well. They require 
complete batch data only available at the end of a batch. Therefore, the complete 
batch trajectory must be estimated real time, or alternatively only the measured values 
at the current moment are used for online diagnosis. Moreover, the above approaches 
do not consider the problem of inconsistent production cycles. 

To address these issues, this chapter presents the modeling of kernel Fisher enve- 
lope surface (KFES) and applies it to the fault identification of batch process. This 
method builds separate envelope models for the normal and faulty data based on 
the eigenvalues projected to the two discriminant vectors of kernel FDA. The high- 
lights of the proposed method include the kernel project aiming at the nonlinearity, 
data batch-wise unfolding, envelope modeling aiming at unequal cycles, and new 
detection indicator easily for online implementation. 


7.1 Process Monitoring Based on Kernel Fisher Envelope 
Analysis 


7.1.1 Kernel Fisher Envelope Surface 


Consider the batch-wise data matrix with J batches, i.e., 


X(k) = [X' (k), X? (k), ..., XJ”, 
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where X! consists of n;(i = 1,..., T) row vectors and each row vector is a sample 
vector X ‘ (k), j = 1, ..., n; acquired at time k and batch i. Each batch has the same 
sampling period but different operation cycles, i.e., batch i has n; (i = 1,2,..., I) 
sampling point. Suppose K is the largest sampling moment among all batches, i.e., 
K = max[nj,n2,...,n7]. 

Let ® (x) be a nonlinear mapping rule that maps the sample data from the original 
space X into the high-dimensional space F. Suppose that each batch is treated as a 
class, then the whole data set can be categorized as J classes. The optimal discriminant 
vector w is obtained using the exponential criterion function in the feature space F. 
Since computing ®(x) is not always feasible, a kernel function can be introduced, 


K (xi, xj) =< D(x), D(x) >= Dx)" G(x). (7.1) 
This kernel function is introduced to allow the dot product in F without directly 


computing ®. According to the principle of reproducing kernel, any solution w € F 
of discriminant vector must lie in the span of all training samples of w: 


w= >> a (x;) = ba, (7.2) 

(i=1) 
where xm, m =1,...,n,n =n, +n +--+ ny is the row vector of X. d(x) = 
[D(x1),..., B(Xn)]; a = (a1, a2, ...a,)'. The eigenvalues T;; are obtained by 


projecting the sampled values P(x’) in space onto w. 
Tj = wP (xi) = al O'G(x') 
=a [P Pi), DPEN D) (1.3) 
= ai. 
The kernel sample vector E is defined as follows: 


ĉi = (KG px), KGa) stg K On, XiT. (7.4) 


Consider the projection of within-class mean vector m?, i = 1,..., I ,the kernel 
within-class mean vector u; is obtained as 


T 


1 ni i 1 ni l 
=| — DK ai) Kæ] (7.5) 
i j=l i j=l 


Then the kernel between-class scatter matrix K, is 
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I 
Ky = J) (4; — po) (p — Ho)” (7.6) 


i=j 


Similarly, consider the projection of overall mean vector m@ to the discriminant 
vector w, the kernel overall mean vector pọ and between-class scatter matrix K w 
can be calculated as 


T 
es in 
Ho = — DK eps D K On xy) (7.7) 
j=l j=l 
1 oe A 
Ky = — DOD E- aE — m)". (1.8) 


i=l j=1 


The discriminant function with the objective of maximizing between class and 
minimizing within class is equivalent to 


tr(a' Kpa) 

tr(al Kya) 

_ tr(al(V,ApV;)a)) 
~ tr(al(V,AyVi ja)’ 


max J (œ) = 
(7.9) 


where K, = VLA, v? and K, = Vp Au VI are eigenvalue decompositions of 
between-class and within-class scatter matrices, respectively. To construct the enve- 
lope surface model, it is usually assumed that two discriminant vectors are obtained, 
namely, the optimal discriminant vector and the suboptimal discriminant vector. The 
kernel sampling vector for sampling point k of batch i is ti, which is projected onto 
the two discriminant vectors to obtain the eigenvalues 7,, and Te 

The eigenvalue vectors of all batch at time k in the first two projection direction 
are [Th Tho -> Tj] and [TR Th, ---, TA]. Their means of the two eigenvalue 
vectors are mean,(k) and mean2(k), respectively. Define that 


max,(k) = max [IT — meani (k)|, <->, ITR — meanı(k)|] (7.10) 
max3(k) = max [|T7j, — meana(k)|, +++, ITẸ, — meana(k)\], 
where max(k) is the larger between max, (k) and max2(k), for all k = 1,2,..., K). 


Then the envelope surface in high-dimensional space is 
(xz — mean, (k))* + (Yk — meanz(k))* = max (k)?(k = 1,2,...,K), (7.11) 


where (xx, yg) is a projection of original data in the feature space, i.e., x, is the 
eigenvalue in the optimal discriminant direction and yg is the eigenvalue in the 


104 7 Kernel Fisher Envelope Surface for Pattern Recognition 


suboptimal discriminant direction. Equation(7.11) gives the envelope surface with 
the maximum variation which allows the eigenvalues at different sampling times for 
this kind of data. 


Unequal Cycle Discussion 


Suppose the production period of each batch is different, i.e., n; is varying with 
the batch i. The envelope surface model is similar as described above, but the dif- 
ference lies in the composition of the eigenvalue vector. As a simple example, it 
is known that there are 7 batches of data in a training data set, and the sampling 
moment k for each batch varies from | to K, K is the largest sampling moment of 
all batches. Suppose only batch 7 does not reach the maximum sampling moment K, 
k=1,...,n;,n; < K.Thecorresponding eigenvalue vectors are [ Th, Ta --- Th] 
and (Tes TE; ee Z| if k=1,...,n;. When the time increases k =n; + 


1,..., K, the eigenvalue vectors are [Th Th, Taik: Tipe sis Tal and 


Tia Ta; tee Teie T E E Tà]. Obviously, the parameters in envelope sur- 
face model (7.11), max(k), maxı (k), and maxz (k) are time varying with k. 


7.1.2 Detection Indicator 


Define the detection indicators as follows: 


IT} — mean,(k)| 


Pi (k) = 
max (k) 

Prk) = IT? — meany(k)| (7.12) 
max(k) 


T (k) = (TD? + (TF, 


where Ti and T% are the eigenvalues obtained by mapping the real-time sampling 
vector x; onto the discriminant vector in the higher dimensional space. When the 
trajectory of eigenvalues at that moment is contained within the envelope surface, 
there must be P,(k) < 1 and P)(k) < 1 holds. If the difference between the new 
batch of data and the training data for this type of envelope surface model is large, 
the Gaussian kernel function used in the kernel Fisher criterion is almost zero, such 
that T}=0, T?=0, i.e., T (k) = 0. Thus, for a given measured data, using the above 
indicators, a judgement can be made. When P;(k) < 1, Po(k) < 1, and T (k) = 0 
does not occur, the data sampled at that moment belong to this mode type. When 
T (k) = 0 occurs consistently, it indicates that the sampled data does not belong to 
this mode type. 
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It is assumed that it has been determined from the normal operating envelope 
surface model that the batch of data is faulty at some point. Fault identification is 
carried out using fault envelope surface models. Consider one of the fault envelope 
surface models, if Pı (k) < 1, Po(k) < 1, and no T (k) = 0, then the batch fault is in 
current fault type. If T (k) = 0 appears consistently in each envelope model, then the 
fault that exists may be a new one. When that fault occurs multiple times, the pattern 
type needs to be updated and an additional envelope model need to be constructed 
for new fault. 

The fault identification using the proposed kernel Fisher envelope surface analysis 
(KFES) is given as follows. Its fault monitoring flowchart is shown in Fig. 7.1. 


Fault Monitoring Algorithm Based on KFES 
Step 1: Collect the historical data with S fault categories. Construct S envelope 
surface models for each category based on the description in Sect. 7.1.1: 


(xz — mean} (k))* + (yk — mean} (k))’=max5(k)?, (k =1,2,...,K). (7.13) 


Then store all the model parameters mean} (k), mean} (k), and max’ (k), (k = 
1,2,..., K). Thus, the envelope model library Env — model (S, k) is constructed. 

Step 2: Sample the real-time data x4. After normalization, the kernel sampling 
vector &; is obtained. 

Step 3: Under the known S fault envelope surface model at time k, project the 
kernel sampling vector €x of x; in the direction of the discriminant vectors. Calculate 
the corresponding project eigenvalues T,', Te and detection indicators. If P} (k) < 
1, PŠ(k) < 1, and T5 (k) A 0, then the fault belongs to category S. 

Step 4: If detection indicators in Step 3 are not satisfied for all known fault type, 
it is possible that a new fault has occurred. When that unknown fault lasts for a 
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period of time, the model library needs to be updated. The envelope surface for this 
new fault is modeled according to the accumulated new batch data as Step 1, and 
augmented into the model library. 


7.1.3 KFES-PCA-Based Synthetic Diagnosis in Batch 
Process 


The basic idea of synthetic diagnosis integrates the advantage of KFES and PCA. 
It builds a multiway PCA model for normal operating in the historical database 
and calculates the monitoring statistics T? and SPE of PCA model and their control 
limits. The multiway PCA is used for fault detection. For the fault data in the historical 
database, the KFES is modeled for known fault categories. The KFES analysis is 
used for fault identification. The modeling and online monitoring process of synthetic 
diagnosis is shown in Fig. 7.2. 

The normal operating data and S classes fault data were obtained from the histor- 
ical data set. Firstly, the normal operating condition data X(J x J x K) is expanded 
into two-dimensional matrix X (J x J K) in the time direction. After normalization, 
the data is unfolding again as Y (ZK x J) in the batch direction. Perform multiway 
PCA on the matrix to obtain score matrix T (IK x R) and load matrix P(J x R), 
where R is the number of principal components. Then calculate the control limits of 
the statistics T? and SPE. 
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Fig. 7.2 Process monitoring flowchart based on KFES-PCA 
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Instead of using contribution maps, kernel Fisher envelope surface analysis is used 
for fault diagnosis. Assume that there are S classes in the fault data set. The envelope 
surface model is first constructed for each fault type. When the new data Xnew,k 18 
obtained, it should be judged whether the current operation is normal by PCA model. 
If the T? and SPE exceed the control limits, and the fault is detected. Then we can 
identify the type of fault by KFES model library. If the eigenvalues do not satisfy the 
indicators in all the known fault models, this fault seems to be new. As long as enough 
data to KFES modeling are collected, update the new fault model in the model library. 


Process Monitoring Algorithm Based on KFES-PCA 
A. Offline Modeling 

Step 1: Develop an improved multiway PCA model for normal operating con- 
ditions data, calculate the statistics T? and SPE, and determine the corresponding 
control Te and SPEjim based on the score matrix T(KI x R) and load matrix 
P(J x R) obtained from the normal model. 

Step 2: Apply KFES analysis to the fault data and construct a fault envelope for 
each type of fault separately. Find the optimal discriminant weight matrix Wa, the 
mean mean, (k), mean2(k), and maximum max (k) of the eigenvalue vectors. 

Step 3: Store Ta and SPEiim, the discriminant weight matrix Wa for each fault 
type, the mean mean; (k), mean2(k), and the maximum max (k) of the eigenvalues. 


B. Online Monitoring 

Step 1: Normalize the new batch of data Xnew,k (J x 1) at the kth sampling 
moment. 

Step 2: Calculate the value of statistics T? and SPE and determine if they are over 
the limit, if not, back to the first step. Otherwise proceed to the next step. 

Step 3: The known fault envelope surface model is used for fault identification 
at that moment. Xnew,k (J x 1) is the sampling data obtained at the first k sampling 
moment, normalized and projected onto the discriminant weight matrix W, of the 
kernel Fisher envelope model to obtain the eigenvalues T} and Te: The eigenvalues 
are substituted into the index, Pı (k) < 1, Po(k) < 1, and no T (k) = 0, and the fault 
is in this fault type. 

Step 4: If a fault has been detected based on step 2, but it does not belong to 
any known fault type obtained from step 3, this indicates that a new fault may have 
occurred. When that unknown fault has occurred several times, the mode type needs 
to be updated and the envelope surface model for that fault needs to be augmented 
with the accumulated batches of new faults in an offline situation. 
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7.2 Simulation Experiment Based on KFES-PCA 


The fed-batch penicillin fermentation simulation platform is used to verify the effec- 
tiveness of the KFES-PCA method for fault diagnosis here. Eleven variables affecting 
the fermentation reaction were selected for modeling, and these variables were air 
flow, stirring power, substrate flow acceleration rate, temperature, etc. Three simula- 
tion failure types were selected as shown in Table 7.1. The total data sets (including 
50 batches) were generated from the Pensim 2.0 simulation platform with 1h sam- 
pling interval, consisting of 20 batches of normal operation, 10 batches of bottom 
flow acceleration rate drop failure, 10 batches of agitation power drop failure, and 10 
batches of air flow drop failure. The normal operation data are obtained at different 
product cycles, one batch with 95h, two batches with 96h, two batches with 97h, 
three batches with 98h, five batches with 99h, and seven batches with 100h. Simi- 
larly, change the reaction duration of each batch, and change the time and amplitude 
of the failure occurrence. The failure batch data are collected. 

Figure 7.3a—d gives the envelope surface of the kernel Fisher discriminant enve- 
lope model under the normal operation and three known fault operations offline 
trained, respectively. Here the x-axis and y-axis represent the direction of the opti- 
mal and suboptimal discriminant vector, and the z-axis represents time. 

The traditional monitoring methods, such as MPCA and MFDA, require the mod- 
eling batches to be of equal length. However, the duration of the different batches 
tends to change in practice. Therefore, the data of different batches must be pre- 
processed with equal length when using these methods. The proposed KFES-PCA 
method unfolds the data in the batch direction during the preprocessing, which can 
simply cope with the unequal batches of data and therefore easily performed in 
practice. 

The following experiments are designed to perform the online detection with the 
known fault and new unknown fault data, respectively. The two batches of test data 
are not included in the training data in order to obtain a valid validation. In addition, 
a comparative validation using the conventional contribution map method and the 
improved MFDA method is also carried out (Jiang et al. 2003). 


Table 7.1 Types of faults in penicillin fermentation processes 


Fault number Fault type 
1 Base flow rate down (step) 
2 Agitator power down (step) 


3 Air flow down (step) 
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Fig. 7.3 Envelope surface for normal and three fault operations 


7.2.1 Diagnostic Effect on Existing Fault Types 


Experiment 1: Step Drop Fault at Stirring Power 
A fault batch data is regenerated for testing with the stirring power drop fault. The 
fault occurs at 50h with a step disturbance of —12% in magnitude until the process 
ends. The sampled data is first monitored based on T? and SPE statistics, as shown 
in Fig. 7.4. It can be seen that T? and SPE continues to exceed the limit from 50h to 
process end. A failure can be detected when it occurs at 50h. Table 7.2 records the 
indicators when it is diagnosed using the envelope surface model of fault 2. It shows 
that there are Pı (k) < 1, P2 (k) < 1, and no T (k) = 0 with time through from 50h to 
100h. So it is concluded that this fault of testing batch belongs to fault 2. Figure 7.5 
shows the diagnosis results based on each envelope surface model. It can also be 
seen that the fault matches with the second type of fault, a mixing power drop fault. 
The contribution plot is used to analyze the testing data at 50h, as shown in Fig. 7.6. 
It is found that the second variable contributes significantly to both the statistics T 
and SPE. This also diagnoses that the fault belongs to fault 2. Therefore, the envelope 
surface model is equally successful in diagnosing the fault type when compared with 
the contribution plot method. 
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Fig. 7.4 Monitoring statistics of KFES-PCA method: experiment 1 


Table 7.2 The indicators detected in fault 2 envelope surface: experiment 1 


k 50 51 52 53 54 55 56 57 tee 100 
Tl 0.044 |0.025 |0.028 |—0.011 |0.032 |0.062 |0.110 |0.083 |--- —0.005 
TŽ 0.159 | —0.145 | —0.233 | —0.141 | —0.173 | —0.205 | —0.271 | —0.202 | --- —0.241 
Pi (k) |<1 <1 <1 <i <1 <1 <1 <1 ee <1 
Pa (k) | <1 <1 <l <l zil <1 <1 <l ee <1 


The comparison experiment is finished based on the improved MFDA method, as 
shown in Fig. 7.7. The horizontal coordinate is time. The vertical coordinate is fault 
type, where 0 represents the normal operation, and 1, 2, 3, and 4 correspond to fault 
1, fault 2, fault 3, and unknown fault, respectively. It can be seen that the improved 
MFDA has a relatively high rate of misdiagnosis and its diagnosis result is not ideal. 


Experiment 2: Step Drop Fault at Air Flow 

The testing fault is air flow drop failure and testing data is regenerated with the failure 
which occurred in 58h, and its amplitude is —10% step disturbance until the process 
ends. The monitoring statistics T? and SPE are given in Fig.7.8. The T? and SPE 
continue to exceed the control limits from 58h to the end, so a fault is detected at 
58h in real time. 

Figure 7.9 is the monitoring result using the proposed envelope surface model. 
Table 7.3 records the indicators when using the envelope surface model of fault 3. 
It can be seen that there are P,(k) < 1, Po(k) < 1, and no T(k) = 0 between 58h 
and 100h, so it is judged that the fault which occurred in this testing batch belongs 
to fault 3. Figure 7.9 shows all the diagnosis results with different envelope surface 
models. It can also be seen that this fault matches with the model of fault 3, i.e., the 
air flow drop fault. 
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Fig. 7.5 Fault diagnosis based on envelope surfaces: experiment 1 
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Fig. 7.6 Contribution plot to statistics T? and SPE at 50h 


The contribution plot of the sampling data at 58h is shown in Fig.7.10, where 
variables 1, 4, 6, and 8 contribute more to the statistic T?. The variable 3 contributed 
more to the statistic SPE. The diagnosis result is not significant. Therefore, the 
envelope surface method can successfully diagnose faults that are not diagnosed by 
the contribution plot. 
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Fig. 7.8 Monitoring statistics of KFES-PCA method: experiment 2 


The comparison results of the improved MFDA method are given in Fig. 7.11. 
It shows a relatively higher rate of misdiagnosis and its diagnosis result is not very 
satisfactory, compared with the proposed KFES-PCA. 
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Fig. 7.9 Fault diagnosis based on envelope surfaces: experiment 2 


Table 7.3 The indicators detected in fault 3 envelope surface: experiment 2 


k 58 59 60 6l 62 63 64 65 tee 100 
Tl 0.110 | —0.110 | —0.171 | —0.133 | —0.220 | —0.182 | —0.100 | —0.054 | --- —0.066 
T 0.237 | —0.162 | —0.259 | —0.141 | —0.393 | —0.378 | —0.273 | —0.332 |--- —0.295 
Pi (k) | <1 <1 <1 <1 <1 <1 <1 <1 e e 
Pi(k) |<1 <1 <1 al <1 <1 <1 <1 e <1 
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Fig. 7.10 Contribution plot to statistics T? and SPE at 58h: experiment 2 
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Fig. 7.11 Fault diagnosis based on improved MFDA: experiment 2 


7.2.2 Diagnostic Effect on Unknown Fault Types 


Experiment 3: Slope Drop Fault at Air Flow Rate 

Here a new fault is used to test the diagnosis ability of the proposed KFES-PCA 
method. The slope faults different from the known three fault types are considered. 
The test fault is a ramp fault in which the air flow rate drops by —15% at 50h. Firstly, 
the T? and SPE statistics are used to detect this new fault. Figure 7.12 shows that the 
T? and SPE statistics both detect this fault in time at 50h. 
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Fig. 7.12 Monitoring statistics of KFES-PCA: experiment 3 
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Table 7.4 The indicator detected in fault 3 envelope surface: experiment 3 
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Fig. 7.13 Fault diagnosis based on different envelope surfaces: experiment 3 


The known envelope surface models are used to diagnose this fault. Table 7.4 
records that all the indicators are zero when the envelope surface model of fault 3 is 
used for diagnosis. It means that no fault 3 has occurred. The same indicator results are 
obtained from the envelope surface models of other known faults. Figure 7.13 gives 
the diagnosis result under the different envelope surface models. So this fault does 
not belong to the known fault category and is diagnosed as a new fault. Therefore, 
the proposed method realizes the real-time diagnosis for unknown faults. 

The diagnosis result of improved MFDA method is given in Fig.7.14. It can be 
seen that the improved MFDA does not make a timely and correct diagnosis when 
the fault occurs. It gives a wrong diagnosis result, fault type 3. The correct result is 
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Fig. 7.14 Fault diagnosis based on improved MFDA: experiment 3 


reported until 63h. This fault is diagnosed as a new fault, and there is a 13h delay. 
Therefore, the improved MFDA method failed to identify new faults. 


7.3 Conclusions 


This chapter describes a monitoring method based on KFES-PCA for batch pro- 
cesses. The production cycles of batch processes are often unequal, and monitoring 
methods for batch processes generally require batch data with consistent production 
cycles. Although data preprocessing can result in equal cycles, these methods can 
result in the loss of important information about faults. In addition, many existing 
monitoring methods often require a complete production trajectory for online mon- 
itoring, and filling or estimating unknown values inevitably leads to a decrease in 
diagnostic performance. To address the above two problems, the modeling process 
of the KFES method is described in detail and an online monitoring flowchart is pre- 
sented. Furthermore, a batch fault diagnosis method integrating the KFES and the 
improved PCA method is proposed. The method is applied to a penicillin fermenta- 
tion simulation platform and compared with the traditional contribution map method 
and the improved MFDA method. The results show that the proposed method has 
better monitoring performance, and it can diagnose faults early and effectively and 
has the ability to identify unknown faults. 
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Chapter 8 A) 
Fault Identification Based on Local E 
Feature Correlation 


Industrial data variables show obvious high dimension and strong nonlinear corre- 
lation. Traditional multivariate statistical monitoring methods, such as PCA, PLS, 
CCA, and FDA, are only suitable for solving the high-dimensional data processing 
with linear correlation. The kernel mapping method is the most common technique 
to deal with the nonlinearity, which projects the original data in the low-dimensional 
space to the high-dimensional space through appropriate kernel functions so as to 
achieve the goal of linear separability in the new space. However, the space projection 
from the low dimension to the high dimension is contradictory to the actual require- 
ment of dimensionality reduction of the data. So kernel-based method inevitably 
increases the complexity of data processing. For this reason, we have proposed 
another kind of nonlinear processing approach based on the manifold learning, a 
class of unsupervised model that seeks to describe data sets as low-dimensional 
manifold embedded in high-dimensional spaces. It characterizes the original data as 
a low-dimensional manifold to achieve the goal of nonlinear correlation processing. 
This strategy is consistent with the goal of dimensionality reduction. Furthermore, 
manifold learning fits the nonlinear correlation by means of piecewise linearization 
in an intuitive sense. It has significantly less complexity compared to the kernel 
mapping method. 

This chapter carries out the pattern classification techniques for multivariate vari- 
ables with strong nonlinear correlation and applies them to the fault identification of 
batch process. Two kinds of pattern classification methods are given in this chapter: 
(1) kernel exponential discriminant analysis (KEDA): this method addresses the non- 
linear correlation properties among multi-variables at two levels, kernel mapping and 
exponential discrimination, respectively. It can significantly improve the classifica- 
tion accuracy compared with the traditional FDA method. (2) The fusion method is 
based on manifold learning and discriminant analysis: two different fusion strate- 
gies, local linear exponential discriminant analysis (LLEDA) and neighborhood- 
preserving embedding discriminant analysis (NPEDA), are given, respectively. Here 
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locally linear embedding (LLE) is a popular algorithm of manifold learning. They 
both combine the advantage of global discriminant analysis with the local structure 
preserving. LLEDA is a parallel strategy to find a trade-off projection vector between 
the local geometric structure preserving and the global data classification. NPEDA 
is a cascaded strategy whose dimensionality reduction process is implemented in 
two serial steps. The two methods emphasize the intrinsic structure of the data while 
utilizing the global discriminant information, so they have better classification than 
the traditional EDA method. Finally, a kind of hybrid fault diagnosis scheme is given 
for the complex industrial process, which consists of PCA-based fault detection, 
hierarchical clustering-based pre-diagnosis, and LLEDA-based final identification. 


8.1 Fault Identification Based on Kernel Discriminant 
Exponent Analysis 


8.1.1 Methodology of KEDA 


The kernel exponent discriminant analysis (KEDA) is also a discriminative classifi- 
cation method, which aims to find a series of discriminant vectors that can transform 
the data into the kernel space and achieve the greatest separation between different 
types of data in the projection direction. 

Consider the batch process data set with J batches, i.e., 


X(k) = [X! (k), X? (k), ..., X(T, 


where X‘ consists of ni,i =1,..., I row vectors, and each row vector is a sample 
vector X i (k), j =1,...,n; acquired at time k and batch i. According to the analysis 
from equations (7.1)—(7.9) in Sect. 7.1.1, the optimization function of kernel Fisher 
discrimination analysis (KFDA) is given as follows, 


tr(a' Kpa) 

tr(al Ka) 

_ tr(al(Vp,ApV;)Q)) 
~ tr(al(V,,AyV5)a)’ 


max J(@) = 
(8.1) 


where K, = VA, Vi and K,= VwAyVi are eigenvalue decompo- 
sitions of between-class and within-class scatter matrices, respectively. 
Ap = diag(Ap1, Àb2; -.-, Abn), and Ay = diag(Aw1, Aw2,+-+,Awn) are the 
eigenvalues, Vp = (vp1, Up2,---, Von), and Vy = (Vwi, Vw2,---,Uwn) are the 
corresponding eigenvectors. The basic objective is to maximize the between-class 
distance and minimize the with-class distance simultaneously during the projection. 
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In order to improve the discrimination accuracy further, the discriminant function 
(8.1) is exponentiated: 


tr(a™(V, exp(Ap)V;, a) 

tr(a'(V,, exp(A,,)V 1, )a) 
tr(a’ exp(K,)@) 

= tr (aT exp(K,)a) 


max J (aœ) = 


(8.2) 


The optimization problem (8.2) is transferred to the following generalized eigenvalue 
problem: 
exp(K,)a = Aexp(K,)a 


or (8.3) 
exp(K,)! exp(K;,)a = Aa, 


where A is the eigenvalue and œ is the corresponding eigenvector. The discrimi- 
nant vectors are calculated from (8.3). Usually, the first two vectors, optimal, and 
suboptimal ones are selected for dimensionality reduction. 

The within-class and between-class scatter matrices are exponentiated in KEDA. 
Consider the general property of exponential function, e* > x for any x > 0, so the 
scatter matrix of KEDA is greater than KFDA. It means KEDA has better discrimi- 
natory capability than KFDA. Moreover, if the amount of sample data is less than the 
number of variables, the rank of within-class scatter matrix is less than the dimension 
of variables. Now the within-class scatter matrix is singular, and its inversion does not 
exist. But both the within-class and between-class scatter matrices are exponentiated 
in KEDA. The exponentiated matrices must be full rank, so the singular problem 
caused by small samples is solved. Thus from this view, the KEDA method not only 
solves the small sample problem, but also efficiently classifies the sample data into 
different categories, which helps to improve the classification accuracy. 

Let’s consider the nonlinear mapping ® (x}) of original sample xi, and project it to 
the optimal and suboptimal discriminant directions, respectively. Then the eigenval- 
ues T;(k) = eae TŻ] and T are obtained, which represent the projection values 
in the optimal and suboptimal discriminant directions. Usually, the data in the same 
class shows the similar project eigenvalues in the direction of selected discrimination 
vectors. If the test data matches with the known fault class, it has maximum projec- 
tion eigenvalue under this model, obviously nonzero. If the test data does not match 
with this class, the eigenvalue is small even close to zero. It is unrealistic to judge 
the data type simply based on the magnitude of eigenvalues. So difference degree D 
between two projection values T; (k) and T ;(k) is defined as follows: 


(TiTi 


D; ;(k)= 1 : 
a IT: lo |T;@I|, 


(8.4) 


The smaller the difference D, the higher the model matched. 
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The KEDA-based fault classification and identification process for batch process 
is given as follows: 


Step 1: Data preprocess. The three-dimensional data set X(L x J x K) is batch- 
wise unfolded into two-dimensional data X(LK x J), normalized along the time 
in the batch cycle and variable-wise re-arranged. 

Step 2: Kernel projection. The original data X is mapping to a high-dimensional 
feature space via a nonlinear kernel function, and the kernel sampling data ¿i = 
[K (x1, x$), K (x2, x$), ..., K (Œn, x} )]" are obtained. 

Step 3: KEDA modeling. The optimal kernel discriminant vectors are solved from 
the discriminant function equation (8.3). Project the sample data ¿i to the selected 
kernel discriminant vectors and calculate the corresponding eigenvalues 7; (k). 
Step 4: Test calculation. The test sample x j new (k) is collected and the correspond- 
ing eigenvalues T; new(k) according to the known S classes model are calculated, 
respectively. 

Step 5: Fault identification. The class of test data can be determined by calculating 
the difference degree between test sample and trained data (8.4). 


8.1.2 Simulation Experiment 


The proposed KEDA was used for fault identification in the penicillin fermentation 
process mentioned in Sect.4.2. Here nine process variables were considered for 
monitoring and three faults are shown in Table 8.1. The data were generated by the 
penicillin simulator when the amplitude and time of fault are changed. A total of 
40 batches were selected as the training data set: 10 batches for normal and known 
3 faults. The KEDA method with Gaussian kernel function was used to find the 
optimal discriminant vectors for each type of model, and four different models were 
obtained. 

Experiment 1: Data classification Figures 8.1, 8.2, 8.3, and 8.4 show the classi- 
fication comparison of KFDA and KEDA for penicillin data: normal data and three 
types of fault data. When the test data are different from the known four types, the 
projections are also separated from each other. But the KFDA shows weaker clas- 
sification performance: some faults are closer together and the boundaries are not 
easily distinguishable, such as fault 3 data (red x) and test fault data (black W) in 
Figs. 8.1 and 8.3. However, the KEDA works better for classifying these data, and 


Table 8.1 Description of the fault type of penicillin process 


No. Faults Types 
Bottom logistics decline Step 

2 Decreased power of the mixer | Step 
Decreased airflow Step 
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Fig. 8.1 Two-dimensional classification visualization: KFDA method 


the red and black parts are classified clearly in Figs. 8.2 and 8.4. These plots show 
that the between-class and within-class distances have increased for different types 
of data in KEDA, but the between-class distance has increased by a larger magnitude 
than the within-class distance. So the different types of data can be better separated. 

Experiment 2: Fault-type identification Let’s consider the testing data set, which 
also consists of the four types of data and an unknown fault data. Table 8.2 gives 
the eigenvalues of the four testing data calculated based on the KEDA model of 
fault 2. The eigenvalues are obtained by projecting the testing data to the selected 
optimal discriminant directions. If there is a large difference between the testing data 
and the training data, then the value of || u — v ||? is large and the exponentiated 
Gaussian kernel function, K (u, v) = exp(—|| u — v 12/20”), is almost close to 
zero. However, sometimes the fault occurrence eigenvalues are not close to zero, as 
shown in Table 8.2. At this case, the eigenvalues of the test data need to be analyzed 
further. 

It is impossible to show the values at any sampling instance, so we further analyze 
the statistical characterizes of eigenvalues projected to the optimal discrimination 
direction of known model. If the eigenvalue of testing data follows a normal distri- 
bution in a model, the testing data belongs to this kind of model. Conversely, if the 
eigenvalue does not follow a normal distribution, it means that the testing data does 
not match with this model. Figures 8.5, 8.6, and 8.7 give the statistical analysis of 
the testing data (normal, faults 1 and 3) in the known fault 3 model. The eigenvalue 
of fault 3 follows a normal distribution in the fault 3 model, while the normal data 
or fault 1 data do not follow a normal distribution. 
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Fig. 8.2 Two-dimensional classification visualization: KEDA method 
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Fig. 8.3 Three-dimensional classification visualization: KFDA method 
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Fig. 8.4 Three-dimensional classification visualization: KEDA method 


Table 8.2 The eigenvalues of test data in fault 2 model 


Sampling Eigenvalues of test data (Tj) 
instant 
Normal Fault 1 Fault 2 Fault 3 New fault 

53 0.148 —0.148 —0.203 0 0 
54 0.194 —0.194 0.0090 0 0 
55 0.448 0 0.1660 0 0 
56 0.187 0 0.1020 0 0 
79 0.079 0 —0.024 0 0 
80 0.103 0 —0.075 0 0 
81 0.108 0 —0.084 0 0 
82 0.041 0 —0.059 0 0 


Moreover, the difference degree between test data and known model is used to 
determine the type of fault. The results are shown in Table 8.3. Since some of the 
test data have zero eigenvalues in the known model, and the denominators in the 
definition (8.4) are zero, the different degree cannot be calculated and expressed as 
“—” The difference degree is small if the test data belongs to the known type model, 
and large if the test data does not belong to the model. It is found that the test data 
has the smallest different degree in the matching model. 
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Fig. 8.5 The eigenvalues of test normal data in fault 3 model 
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Fig. 8.6 The eigenvalues of test fault 1 data in fault 3 model 
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Table 8.3 The difference degree of test data in different models 

Fault 1 model Fault 2 model 
0.669503 1.448272 
0.223966 - 


Fault 3 model 
1.630094 
1.578313 


Normal model 
0.516679 


Type of test data 


Fault 1 


Fault 2 0.632128 0.550645 1.194915 
Fault 3 - - - 0.553784 
New fault 1.120218 - - 1.137496 


8.2 Fault Identification Based on LLE and EDA 


The new dimensionality reduction approach based on the combination of EDA and 
LLE is proposed with two different combination performances, Local Linear Expo- 
nential Discriminant Analysis (LLEDA) and Neighborhood-Preserving Embedding 
Discriminant Analysis (NPEDA). This fusion idea combines the global discrimi- 
nant analysis with local structure preservation during the dimensionality reduction 
process. LLEDA and NPEDA are solved by different optimization objectives, respec- 
tively, and the corresponding maximum values are derived to reduce the computa- 
tional complexity. They both exhibit the good local preservation and global dis- 
crimination capabilities. The nonlinear analytics is transformed into an equivalent 
neighborhood holding problem based on the idea of piecewise linearization. 

The main difference between the two methods is that LLEDA is a parallel strategy 
whereas NPEDA is a cascading strategy. LLEDA focuses on the global supervised 
discrimination balanced with local nonlinear dimensionality reduction. It finds a 
balanced projection vector between the local geometry and the data classification 
and results in an optimal subspace projection of the samples. When faults are diffi- 
cult to distinguish, LLEDA method can improve the identification rate by adjusting 
the trade-off parameter between the global index and the local index. NPEDA is a 
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cascading strategy where the dimensionality reduction process is implemented in 
two successive steps: the first aims at maintaining the local geometric relationships 
and reconstructing each sample point using a linear weighted combination of near- 
est neighbors, the second at performing discriminant analysis on the reconstructed 
sample. 


8.2.1 Local Linear Exponential Discriminant Analysis 


The basic idea of LLEDA is to project the samples into the optimal discriminant space 
while maintaining the local geometric structure of the original data. The schematic 
diagram is shown in Fig. 8.8. LLEDA combines the advantages of LLE and EDA, 
which extracts the global classification information while compressing the dimen- 
sionality of the feature space without destroying local relationships. It finds a bal- 
ance between global supervised discrimination and local preservation of nonlinearity 
through an adjusted trade-off parameter. 

Consider the original data being mapped into a hidden space F via function A. 
An explicit linear mapping from X to Y, Y = ATX is constructed to circumvent the 
out-of-sample problem. The original LLE problem is written as follows: 


2 


n k 
mine(Y) = $ [yj — > Wiryjr| =I YU — W) |? 

= im (8.5) 

=tr(YU — W)U— W)'Y") 

= tr(A'XMX'A). 

The LLEDA problem is proposed with the following objective function: 
tr (A" exp(S,)A 
max J(A) = — (A exp(Sp)A) u- tr (ATXMX7A, ) (8.6) 


tr (AT exp(S,)A) 
where u is a trade-off parameter that balances the intrinsic geometry and global 


discriminant information. In general, (8.6) is equivalently transformed into an opti- 
mization problem with constraint, 


max J(A) = tr (A’ exp(S;,)A) — u - tr(A'XMX" A) 


8.7 
s.t. Alexp(S,)A = J, ie 


where A = [@1, a2, ..., an]. (8.7) is solved by introducing the Lagrangian multiplier: 


Li (ai) = a; (exp(S») — wXMX") a; + O(1 — aj exp(Sw)a;i), (8.8) 
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Fig. 8.8 The schematic diagram of LLEDA 


where @ is Lagrangian multiplier. According to the zero gradient in Lı(a;) with 
respect to a;, we have 


(exp(Sb) — wXMX")a; = 0 exp(S»)a; 
or (8.9) 
(exp(Su)™' (exp(S,) — uX M Xa; = ba;, 


where @ is treated as a generalization eigenvalue. The discriminant matrix A is made 
up of the corresponding eigenvectors of the first d largest eigenvalues in (8.9). 
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8.2.2 Neighborhood-Preserving Embedding Discriminant 
Analysis 


NPEDA is also to find a series of discriminative vectors and map the samples into a 
new space. The sample point is represented linearly by their neighbors to maintain the 
local geometry as much as possible during the projection process. The schematic dia- 
gram is shown in Fig. 8.9. NPEDA is a cascade strategy in which the dimensionality 
reduction process is divided into two successive steps, the first aiming at maintaining 
local geometric relationships and the second aiming at a discriminant analysis in 
which each sample point is reconstructed by a linearly weighted combination of its 
neighbors. 

Rewrite the between-class scatter matrix Sp and the within-class scatter matrix 
Sa under the explicit linear mapping Y = ATX: 
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Fig. 8.9 The schematic diagram of NPEDA 
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1 
— x; andx, € k-th class. 
Bij = Nk 


0 otherwise. 


c Ni c Ni 
Sw = oo} -P= D (Axi -ATF Y 


i=l j=1 i=1 j=1 


= AT (= ( (xi -Dai -*')} A 
t= \j=1 


Ç nj 
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i=l \j=1 


i=l 


Č 
1 
= AT k (xix? z Laechal)] A 
L 


c 
=A" X (X;{L;X})A, 
i=l 
where L; = I — “eel, I is unit matrix, and e; = [1,1,..., 1]" with dimension n;. 
The discriminant vectors A* are solved by the following optimization problem: 


|A'X(B — tee") X7A| 


. 8.12 
|A" Dj | (Xi L;X7)A| ae 


A* = arg max 
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Considering that the original data is reconstructed by its neighbors less than e: 


2 lx; - 3 W jrX jr II?< €, 
where £ is a small positive number. W is reconstruction mapping matrix such that 
>, Wi, = 1. Then 


2 2 


WirXir) 


ir 


k 
— ) WirXir 
r=1 


where Q; = [Xj —Xj1,X; —Xj2,..., Xi — Xir]. 
Matrix W can be solved by Lagrange multiplier. 


1 2 : 
b= gam -a[ Som -4] 
OL» 


OW; 


= Q'0,W; —\;E =C;W; —\;E =0, 


where W; = \;C;'E,C; = Q/ Q;, E =[1,1,..., 1]" with dimension k. 
Considering 


k 
XO Wi; = EW; = 1 = E'A C'E = 1 = (EC, Ey, 
rsi 


we have 


The sample point is reconstructed by the optimal weights W, i.e., xj = 
Ys W jx jr. It is linearly represented by its neighbors by maintaining the local 
geometry in the dimensionality reduction process. Substitute it into (8.12) and 
NPEDA optimization is revised as follows: 


|a" exp (Ea Wirxir)(B — teed) Wirxir)") A| 


|A" exp (E (Ore Wir Ea Wj-X4,)7) A| (8.13) 
|AT exp(S,»)A| 
|AT exp(Siw) A| 


A* =arg max 


= arg max 
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Equation (8.13) is equivalently to solve the maximum eigenvalue of the generalized 
eigenvalue decomposition problem: 


exp(Snb)A = o eXP(Snw) A 
or (8.14) 
exp(Snw) | exp(Snb)A = CA, 


where o is the generalized eigenvalue and the linear transformation matrix A 
of NPEDA is the eigenvector corresponding to the first d largest eigenvalues of 
(exp(Snw)) | exp(Snp). 


8.2.3 Fault Identification Based on LLEDA and NPEDA 


In this section, the LLEDA and NPEDA methods are implemented for fault identi- 
fication with monitoring flowchart, as shown in Fig. 8.10. The fault recognition rate 
(FCR) is introduced to test the identification effectiveness. FCR of fault model i is 
defined as the percentage of test data identified in this corresponding model out of 
the total number of samples tested: 


Ni identify 


FCR(i) = x 100%, (8.15) 


Nall 


where Mi identify denotes the sample size identified as fault i and nay denotes the 
sample size of all samples of fault i. The identification process is given as follows, 


1. Process data are collected under the normal and faulty conditions, and standard- 
ized. 

2. The between-class scatter matrix S, and the within-class scatter matrix S„ are 
calculated by the LLEDA (or NPEDA) method, respectively. 

3. The discriminant vector A is obtained by maximizing the between class dispersion 
matrix S, and minimizing the with class dispersion matrix S,,. 

4. The discriminant function g(x) of the online data x is observed by the projection 
of discriminant vector A in the normal model: 


-1 
g(x) =— ae —x'\"A (“r exp(SyA) Al (x —x') 
2 nj — 1 
i f (8.16) 
+In(c) — — In act (4 exp(5i)A ) | . 
2 ni — 1 


If the value of the discriminant function exceeds the normal limitation, a fault 
occurs. 

5. The fault type of online data can be determined when its posterior probability value 
is maximum. The posterior probability of data x in fault c; class is calculated as 
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Fig. 8.10 Flowchart of fault identification with LLEDA and NPEDA methods 


P(x|x € ¢;)P(x € ci) 
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(8.17) 


where P(x € ci) is the prior probability and P(x|x € c;) is the conditional prob- 
ability density function of the sample x: 


exp[—3(x — ¥')"AP, A’ (x — x’)] 


OT) HA Eren Œ EE DAN 


ni— 


P(x|x € ci) = 


; (8.18) 


where Pg = [HA vee, (© E DAT. 


8.2.4 Simulation Experiment 


Multi-classification methods, FDA, EDA, LLE+FDA, LLEDA, and NPEDA, were 
carried to evaluate the classification performance in TE simulation platform. TE 
operation lasted for 48h, with faults occurring in the 8thh and sampled every 3 min. 
400 training data were selected for building the classification model and 400 testing 
data for evaluating the performance of the model. Three different types of faults were 
considered: faults 2, 8, and 13. Fault 2 refers to a step change in the B component feed 
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with the A/C feed ratio remaining constant. Fault 8 refers to a random change in the 
A, B and C feed component variables. Fault 13 refers to a slow drift in the reaction 
dynamics. Here faults 8 and 13 are difficult identified due to its random variation and 
slow drift. The training and testing data for the three types of faults were projected 
onto the first and second eigenvectors, respectively, by different methods and the 
classification results are shown in Fig. 8.11. 

Table 8.4 shows the identification rate for faults 2, 8, and 13 under different classi- 
fication methods. Here the number of discrimination directions, i.e., reduction order, 
is considered from 1 to 10. It is shown that the identification rates are improved 
with increasing the number of discrimination vectors. The recognition rate for fault 
2 is high, almost close to 100%. The recognition rate for faults 8 and 13 gradually 
increases as the number of discrimination vectors increases. NPEDA and LLEDA 
show higher recognition rates on faults 2, 8, and 13, compared with other methods, 
such as FDA and LLE+EDA. 

Figure 8.12 shows the posterior probability values for the different test data under 
the LLEDA and NPEDA methods. The larger posteriori probability values mean the 
higher possibility of the test data belong to this category. Furthermore, the diagnostic 
results are related to the classification capability. If the classification performance is 
good, higher identification rate is achieved. 


8.3 Cluster-LLEDA-Based Hybrid Fault Monitoring 


8.3.1 Hybrid Monitoring Strategy 


Generally, the data collected from an actual industrial process are unlabeled and 
initially undiagnosed. Itis worth noting that the LLEDA method performs well in fault 
identification, but it is a supervised algorithm that requires the known classification 
of the historical data set. To overcome this problem, the supervised LLEDA method 
is extended into an unsupervised learning method by introducing the cluster analysis 
method. The cluster method can obtain the fault data category information which is 
input to LLEDA modeling module as a prior. To make better use of the proposed 
cluster-LLEDA classification method, a hybrid fault monitoring strategy is given, as 
shown in Fig. 8.13. 

Figure 8.13 indicates that the hybrid fault monitoring strategy is mainly divided 
into three parts, historical data analysis, fault model library establishment, and 
online detection and fault identification. First, the historical data of industrial pro- 
cesses is roughly detected by PCA to label the fault data. Then hierarchical clustering 
technique is used to classify the process data detected as fault into different types. 
The model library is established for all fault types by LLEDA, which further extracts 
the fault features and obtain fine identification. Finally, the online detection and fault 
identification are realized. 

The procedure of historical data analysis part is summarized as follows: 
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Fig. 8.11 Projection of different fault data on the first two feature vectors 
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Table 8.4 Comparison of identification rate for faults 2, 8, and 13 
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Reduction | Fault No. EDA EDA LLE+EDA |LLEDA | NPEDA 

order 

1 Fault 2 1 1 1 1 1 
Fault 8 0.4425 0.2125 0.4625 0.2125 0.2125 
Fault 13 0.415 0.6875 0.4175 0.6875 0.6875 

2 Fault 2 1 1 1 1 1 
Fault 8 0.3525 0.475 0.48 0.4175 0.475 
Fault 13 0.36 0.6325 0.3475 0.6875 0.6325 

3 Fault 2 1 1 1 1 1 
Fault 8 0.4375 0.67 0.3825 0.5975 0.67 
Fault 13 0.29 0.55 0.3375 0.6275 0.55 

4 Fault 2 1 1 1 0.9925 1 
Fault 8 0.47 0.8325 0.425 0.705 0.8325 
Fault 13 0.2825 0.6575 0.295 0.565 0.6575 

5 Fault 2 1 1 995 1 1 
Fault 8 0.625 0.8825 0.4875 0.815 0.8825 
Fault 13 0.53 0.6375 0.3025 0.5975 0.6325 

6 Fault 2 1 1 1 1 1 
Fault 8 0.664 0.9325 0.62 0.895 0.9325 
Fault 13 0.5125 0.7225 0.25 0.6225 0.7225 

7 Fault 2 1 1 9925 1 1 
Fault 8 0.695 0.8925 0.6 0.9125 0.8925 
Fault 13 0.49 0.7425 0.2425 0.725 0.7425 

8 Fault 2 1 1 9825 1 1 
Fault 8 0.7275 0.88 0.7075 0.885 0.88 
Fault 13 0.4775 0.74 0.2275 0.7125 0.74 

9 Fault 2 1 1 0.99 1 1 
Fault 8 0.745 0.88 0.6575 0.89 0.88 
Fault 13 0.49 1 0.995 1 1 

10 Fault 2 0.99 1 0.995 1 1 
Fault 8 0.7625 0.8725 0.5825 0.8825 0.8725 
Fault 13 0.47 0.735 0.225 0.7125 0.735 

1. Collect and standardize the normal process data from the DCS historical database. 


2. Analyze the collected process data by PCA to extract the independent principle 
components, establish PCA model of the normal operation, and calculate the 


statistics of the data. 
3. Calculate the statistics T? and SPE and their control limit. 


138 


fault2 test data (LLEDA) 


fault2 test data (NPEDA) 
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Fig. 8.12 Diagnosis results of faults 2, 8, and 13 by LLEDA and NPEDA methods 
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Fig. 8.13 Hybrid fault detection and diagnosis information process 


The procedure of fault model library establishment is summarized as follows: 


1. Perform hierarchical clustering analysis on the abnormal operation data and divide 
them into different fault categories. 

2. Calculate the between-class and within-class scatter matrices S, and S,,, find the 
corresponding projection vector A based on LLEDA method, and establish the 
fault model library for all fault classes. 


The procedure of online detection and fault identification is summarized as fol- 
lows: 
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1. Sample the real-time data and standardize it. 

2. Perform the discriminant analysis based on LLEDA method, project the sample 
data to the projection direction, and extract the feature vector. 

3. Project the sample data to the projection vector A based on the normal model 
and judge the current operation is normal or abnormal by observing whether the 
discriminant function exceeds the limit. 

4. Ifa fault occurs, calculate the posterior probability in each fault model to identify 
the fault type. If the sample data is not in the existing fault category, this new fault 
will be modeled and introduced into the fault model library. 


Clustering Analysis The hierarchical clustering algorithm is more widely used and 
has the advantages of simple calculation, fast and easy to obtain similar results, with- 
out knowing the number of clusters in advance (Saxena et al. 2017). The clustering 
starts with n samples each as a class, specifies the distance between samples and the 
clustering between classes. Then the two closest classes are merged into a new class, 
and the distance between the new class and the other classes are calculated. Repeat 
the merging process between the two closest classes, and the number of classes 
is reduced by one after each merging. The merging will stop until all samples are 
merged into one class or a certain condition is met. 

The class is denoted by G in the cluster analysis. Suppose class G has m samples 
denoted by the column vector x;(i = 1,2,...,m), dij is the distance between x; 
and xj, and Dx, is the distance between two different categories Gx and Gz. The 
squared distance Dx; between Gx and Gy, is defined as follows: 


1 
Dki = — Sct x jeG, dh. (8.19) 
NKNL 


The recursive formula for between-class squared clustering is 
L De (8.20) 
M 


The inconsistency coefficient Y is used to determine the final number of clusters 
c. Here Y is a matrix of (n — 1) x 4, where the first column is the mean of all link 
lengths (i.e., merging class distances) involved, the second column is the standard 
deviation of all the related link lengths, the third column is the number of related 
links, and the fourth column is the inconsistency coefficient. 

For the links obtained by the kth merging class, the inconsistency coefficient is 
calculated as follows: 

(Z(k, 3) — Y (k, 1)) 


Y(k,4) = YED) ; (8.21) 


where the input Z(,-1)x3 is a matrix of systematic clustering trees. Under the con- 
dition that guarantees the number of classes as small as possible, the change of the 
inconsistency coefficient determines the final value of classes number. 
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8.3.2 Simulation Study 


The experiment uses the Tennessee Eastman (TE) process to evaluate the effective- 
ness of the proposed hybrid method. 

Experiment 1: Failure Initial Screening and Classification The TE data set 
was first detected by the PCA method, and the fault detection results are shown in 
Fig. 8.14, the final T? and SPE statistics obtained were 0.4951 and 0.6882, respec- 
tively. The specific detection is shown in Table 8.5. The results show that the recog- 
nition rate of faults 1, 2, 6, 7, 8, 12, 13, 14, 17, and 18 is high, and the recognition 
rate of other faults is low. This indicates that the significant faults can be detected, 
while the potential faults cannot be detected. 

Therefore, PCA-based fault detection methods can only coarsely split the data 
set and detect significant faults. Potential faults can be identified with a high fault 
identification rate only in the case of known fault categories. In the coarse separation 
stage of historical data, the fault data can be identified not only by PCA method, 
but also by improved PCA or other fault detection methods to further improve the 
identification rate. 

After the historical data analysis, the fault data set is collected and clustered into 
different fault classes by using the hierarchical clustering method. According to the 
inconsistency coefficient, the final number of fault classes is 10. As the fault type is 
in a large number, it is difficult to display the classified fault data together in a tree 
diagram. As example, we select the faults 1, 2, and 6 to demonstrate the clustering 
effect of the hierarchical cluster analysis algorithm. Fault 1 is a step change in the 
A/C feed ratio with component B remaining unchanged, while fault 2 is a step change 
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Fig. 8.14 Fault detection based on PCA 


8.3 Cluster-LLEDA-Based Hybrid Fault Monitoring 141 


Table 8.5 Fault recognition rate based on PCA 


Fault No. T? SPE Fault No. T? 

Fault 1 0.995 0.9988 Fault 12 0.9875 0.99 
Fault 2 0.9825 0.9925 Fault 23 0.9513 0.9625 
Fault 3 0.0225 0.2675 Fault 14 0.9988 1 
Fault 4 0.41 1 Fault 15 0.0488 0.2625 
Fault 5 0.2625 0.5025 Fault 16 0.2325 0.6937 
Fault 6 0.99 1 Fault 17 0.8013 0.975 
Fault 7 1 1 Fault 18 0.8912 0.9375 
Fault 8 0.975 0.9825 Fault 19 0.0675 0.5913 
Fault 9 0.0362 0.235 Fault 20 0.3738 0.735 
Fault 10 0.4163 0.7638 Fault 21 0.3775 0.6687 
Fault 11 0.5212 0.8163 


Hierarchical clustering 


distance 


12 19 2011181713141615 1 4 2 9 3 5 6 7 10 8 21 22 24 23 29 30 25 28 27 26 
samples 


Fig. 8.15 Hierarchical cluster analysis 


in component B with the A/C ratio remaining unchanged. Fault 6 is a step change in 
the feed loss of A. The hierarchical clustering tree diagram is given Fig. 8.15. The 
final number of categories is three according to the inconsistency coefficient, which 
is consistent with the actual classification. 

Now the fault data have been divided into 10 classes by hierarchical cluster anal- 
ysis. Obviously, the dimension is high and its visualization effect is poor. In order to 
improve the visualization effect and reflect the change trend and the interrelationship 
between each variable at the same time, the parallel coordinate visualization method 
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Fig. 8.16 Parallel coordinate visualization of fault data 
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Fig. 8.16 (continued) 


is selected. It is a visualization technique that allows the high-dimensional variables 
to be represented by a series of axes parallel to each other. The value of the variables 
is corresponding to the positions on the axes. 

The visualization results for each type of fault data are shown in Fig. 8.16. The 
blue dash in each subplot indicates the normal data and the other color dashes indicate 
different fault data. Since each variable in the TE data has a corresponding actual 
physical meaning, the type of fault can be judged by comparing the other color dashes 
with the blue dash in each variable. These faults can be labeled for establishing the 
fault model library. 

Experiment 2: LLEDA-based Fault Identification The fault identification 
method used here is LLEDA, which increases the distance between different classes 
and improves the classification ability even if fault samples are small. Here faults 4, 
8, and 13 are selected as example to show the identification results. Fault 4 is a minor 
fault, which is manifested in the step change of the inlet temperature of the reactor 
cooling water, but the other 50 variables are still in a stable state, and the change 
is less than 2% compared with the normal data. Fault 13 refers to the slow drift of 
reactor kinetic constants when the fault occurs, which will cause a violent reaction 
of each variable, and the final product G is always in a fluctuating state. Fault 8 refers 
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Fig. 8.17 Projection of different fault data on feature vectors 


to the change of random variables of A, B, and C feed ingredients when the fault 
occurs. 
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Fig. 8.18 Diagnosis results of fault 4, 8, and 13 by LLEDA methods 


To better observe the classification in spatial structure, the training data and testing 
data of the three faults are projected onto the first three feature vectors by different 
methods. The classification results are shown in Fig. 8.17. 

Figure 8.18 shows the posterior probability values of different test data by LLEDA 
method under different models. The posterior probability values are larger when the 
samples belong to category i. The colored bars indicate the diagnostic result, i.e., 
probability values, in which color bar from bottom to top is corresponding to the 
probability values 0-1 (white indicates that the probability of identification is 0 and 
red indicates that the probability value of identification is 1.) In this way, the fault 
identification results are visualized. The diagnosis result is related to the classification 
ability. The better classification performance leads to a higher fault recognition rate. 
Here fault 13 is in poor classification owing to the small number of feature vectors. 
The recognition rate of faults can be improved by increasing the number of feature 
vectors. 


8.4 Conclusion 


This chapter presents three discriminant analysis methods, KEDA, LLEDA and 
NPEDA, that can handle nonlinearities and avoid small sample data problems. Nor- 
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mal and faulty data models are developed, and these models are used to check whether 
abnormal behavior occurs, and variance-based performance metrics are used to iden- 
tify the type of data tested. Especially, two new supervised dimensionality reduction 
methods, LLEDA and NPEDA, are proposed which combines the advantages of 
local linear embedding and exponential discriminant analysis methods, taking into 
account both global and local information. The nonlinear data is piecewise linearized 
by maintaining the internal structure during the extraction of the eigenvalues. They 
overcome the singularity problem of within-class scatter matrices, and therefore show 
good performance for the small sample problem. 

Furthermore, the hybrid process monitoring and fault identification algorithm is 
proposed in this chapter, which effectively combines the PCA initial detection, the 
classification of hierarchical clustering, and the discriminative analysis of LLEDA. 
This hybrid method ensures the monitoring and diagnosis is performed directly on 
the collected data without a priori knowledge. 
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Chapter 9 A) 
Global Plus Local Projection to Latent geit 
Structures 


Owing to the raised demands on process operation and product quality, the mod- 
ern industrial process becomes more complicated when accompanied by the large 
number of process and quality variables produced. Therefore, quality-related fault 
detection and diagnosis are extremely necessary for complex industrial processes. 
Data-driven statistical process monitoring plays an important role in this topic for 
digging out the useful information from these highly correlated process and quality 
variables, because the quality variables are measured at a much lower frequency and 
usually have a significant time delay (Ding 2014; Aumi et al. 2013; Peng et al. 2015; 
Zhang et al. 2016; Yin et al. 2014). Monitoring the process variables related to the 
quality variables is significant for finding potential harm that may lead to system 
shutdown with possible enormous economic loss. 

PLS is a typical multivariate statistical analysis technique in two coordinate space, 
which is well suitable for the quality-related fault detection and process monitor- 
ing. However, actual industrial data are often with the features of strong nonlinear 
dynamic and coupled, etc. PLS method only considers the static linear mapping 
between multiple sources of data, so it is difficult to achieve accurate detection 
results by directly applying PLS. It becomes an important direction how to intro- 
duce the local structure-preserving capability to the global structure projection of 
PLS, in order to extract the complex features of industrial data. This idea of global 
structure and local structure fusion can usually be implemented by two strategies, 
plus and embedding. This chapter focuses on the idea of plus, global, and local 
partial least squares (GLPLS) which is introduced first. Global plus local projec- 
tion to latent structure (GPLPLS) method is further proposed, and three different 
performance functions are given from the projection requirements of input measure- 
ment space and output measurement space, separately or simultaneously. The next 
two chapters focus on the idea of embedding, two different embedding methods, 
locality-preserving partial least squares (LPPLS) and local linear embedded projec- 
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tion of latent structure (LLEPLS), are proposed, which use LPP and LLE as local 
structure-preserving technique, respectively. 


9.1 Fusion Motivation of Global Structure and Local 
Structure 


Currently, partial least squares (PLS), which is one of those data-driven methods 
(Severson et al. 2016; Ge et al. 2012; Li et al. 2010; Zhao 2014; Zhang and Qin 
2008), is widely used because of its advantages in extracting the latent variables by 
establishing the relationship between input and output space for quality-relevant pro- 
cess monitoring (Qin 2010). It maintains the maximum correlation between quality 
and process variables and has better quality-related fault detection capability. How- 
ever, the nature of PLS is a linear projection, which is not applicable for nonlinear 
systems. It uses only global structural information with information such as mean and 
variance and performs poorly in systems with strong local nonlinear characteristics. 

Nonlinear PLS methods can be divided into two categories: external nonlinear 
PLS models and internal nonlinear PLS models, as shown in Fig. 9.1. 

External nonlinear PLS models are used as a class of nonlinear PLS models that 
introduce nonlinear transformations in the input and/or output variables. An exam- 
ple is kernel partial least squares (KPLS) (Rosipal and Trejo 2001; Godoy et al. 
2014; Rosipal and Trejo 2001), which is used to describe the nonlinear relationship 
between the independent variables and for extending the linear relationship between 
the inputs and outputs. KPLS effectively solves the nonlinear problem between the 
principal components for input space and output space, but the selection of kernel 
function is more difficult in practical applications. Similarly, the kernel concurrent 
canonical correlation analysis (KCCCA) algorithm is proposed for quality-relevant 
nonlinear process monitoring that considers the nonlinearity in the quality vari- 
ables (Zhu et al. 2017). Kernel-based methods map the original data into a (possibly 


Fig. 9.1 Outer and inner model presentation for linear PLS decomposition 
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high-dimensional) Hilbert space (eigenspace), but the projection in the eigenspace 
is complex, the direction and length of the projection cannot be determined, and the 
choice of kernel function is not straightforward. 

Inner nonlinear PLS model is where the internal linear model between latent vari- 
ables is replaced by a nonlinear model, but its external model remains unchanged, 
such as quadratic partial least squares (QPLS) (Wold et al. 1989), spline function 
PLS (SPLS) (Wold 1992), and neural network PLS (NNPLS) (Qin and McAvoy 
1992, 1996) approaches. Recursive nonlinear PLS (RNPLS) models are built by 
extending the input and output matrices on top of PLS (Li et al. 2005); nonlinear 
PLS (NPLSSLT) based on the slice transformation (SLT) can be used for nonlinear 
correction, where SLT-based segmented linear mapping functions are used to con- 
struct nonlinear relationships between input and output score vectors (Shan et al. 
2015); and nonlinear iterative partial least square algorithm (NIPALS) is improved 
by assuming that the score vector is a linear projection of the original variables in 
the internal nonlinear PLS, at the cost of increased computational complexity and 
optimization complexity. 

PLS methods have nonlinearities in both the outer model and the inner model. 
An example is the orthogonal nonlinear PLS method (O-NLPLS) which considers 
orthogonal correlated nonlinearities between the input and output variables (Doymaz 
et al. 2003). This method retains the orthogonality properties of the PCA method due 
to the fact that it is based on a neural network architecture. Similarly, RBF network is 
used to identify the nonlinearity of the input variables and to establish the nonlinear 
relationship between the input and output variables (Zhao et al. 2006; Shimizu et al. 
2006). 

The different linear PLS representations are mathematically equivalent. How- 
ever, using different nonlinear PLS methods results in different performance and 
characteristics. Existing nonlinear PLS methods have some shortcomings, such as 
the problem of choosing kernel functions or latent structures for unknown nonlinear 
systems; the problem of increasing computational complexity when using neural 
networks for nonlinear mapping; and the lack of a superior PLS decomposition algo- 
rithm. Therefore, how to simplify the nonlinear PLS modeling problem is an urgent 
need to be solved. 

Considering that PLS and its extended algorithms only focus attention on the 
global structural information and cannot extract the local adjacent structural infor- 
mation of the data well, they are not suitable for the extraction of nonlinear features. 
Therefore, the local linearization method for dealing with nonlinear problems is taken 
into account. In recent years, locality-preserving projections (LPP) (He and Niyogi 
2003; He et al. 2005), which belong to the manifold learning method have been 
proposed to solve the local adjacent structural feature problem and effectively make 
up for this deficiency. In addition, there are many other manifold learning methods, 
such as isometric feature mapping (Tenenbaum et al. 2000), local linear embedding 
(LLE) (Roweis and Saul 2000), Laplace feature map (Belkin and Niyogi 2003), etc. 

Manifold learning methods preserve the local features by projecting the global 
structure to an approximate linear space, and by constructing a neighborhood graph 
to explore the inherent geometric features and manifold structure from the sample 
data sets. But these methods cannot consider the overall structure and lack a detailed 
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analysis and explanation of the correlation between process and quality variables. 
Therefore, combining the global projection methods, such as PLS, and the manifold 
learning method, such as LPP and LLE, has become a new topic of concern for a 
growing number of engineers. 

Regarding the combination of global and local information, Zhong et al. proposed 
a quality-related global and local partial least squares (GLPLS) model (Zhong et al. 
2016). The GLPLS method integrates the advantages of the LPP and PLS methods, 
and extracts meaningful low-dimensional representations from the high-dimensional 
process and quality data. The principal components in GLPLS preserve the local 
structural information in their respective data sheets as much as possible. However, 
the correlation between the process and quality variables is not enhanced, and the 
constraints of LPP are removed in the optimization objective function. Therefore, 
the monitoring results are seriously affected. 

After further analysis of the geometric characteristics of LPP and PLS, a new 
integration method called the locality-preserving partial least squares (LPPLS) model 
that was proposed by Wang et al. pays more attention to the locality-preserving 
characteristics (Wang et al. 2017). LPPLS can exploit the underlying geometrical 
structure, which contains the local characteristics, in input and output space. Although 
the maximization of correlation degree between the process and quality variables was 
considered, the global characteristics were converted into a combination of multiple 
local linearized characteristics and were not expressed directly. In many processes, 
the linear relationship may be the most important, and the best way is to describe it 
directly rather than through a combination of multiple local linearized characteristics. 


9.2 Mathematical Description of Dimensionality Reduction 


9.2.1 PLS Optimization Objective 


PLS algorithm is used to model the relationship between the normalized 
data sets X = [x(1),x(2),...,x(n)] € R"*” (x = [x1, x2,...,%m]') and Y= 
[y(1), y(2),..., y] € R™! (y = [y1, y2,---, W]). X is the process variable and 
Y is the quality variable. m and / are the dimensionality of the input and output spaces, 
and n is the number of samples. X and Y are decomposed as follows: 


X=TP'+X (9.1) 
Y=UQ'+Y, (9.2) 
where T = [t), to,...,tg] € R'%%, and U = [u], uw, ..., ug] € R"*4 are the score 


matrices of X and Y, respectively. P =[p,, po,..., pa] €R"*4 and Q = 
[q1; q2, ---, qa] € R'*@ are the load matrices of X and Y. X € R’*” and ¥ € R”™*! 
are the residual matrices of X and Y. d is the number of latent variables. The weight 
vectors w and c are derived by the NIPALS algorithm such that the covariance of 
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score vectors t and u is maximized. 


max cov(t, u) = y Var(t)Var(u)r(t, u) 


(9.3) 
= /Var(Xw)Var(Ye)r(Xw, Ye). 


Equation (9.3) is actually equivalent to solving the following optimization problem: 


max < Xw, Yc > 
w,c (9.4) 
s.t. ||w|] = 1, llel] = 1 


or 
Jp_s = max w X'Ye 


(9.5) 
s.t. |w] = 1, lel] = 1. 


9.2.2 LPP and PCA Optimization Objectives 


LPP aims to project points in space X into low-dimensional space ® = 


LAOK AONE PAON € R"™4(d <m, œ = [¢ġ1, . - ., Qal) via the projection 
matrix W = [w], ..., wg] € R”*%, that is, 


(i) = x(i)W, (i = 1,2,...,n). (9.6) 


The optimal mapping of the input space can be obtained by solving the following 
minimization problem: 


: 1 n 
Jipp(w) = min 5 5 Ip; — ll sxi 


i,j=l 


(9.7) 
= min (w'X'D,Xw — w' X'S, Xw) 


s.t. w'XTD,Xw = 1, 


where S, = [sxij] € R"”” is the neighboring relationship matrix between x; and xj. 
D, = (d,i;] is a diagonal matrix, dyi; = D Sxij, and 
j 


lO) 
e 2% | x i) and x(j) € “neighbors” 
Sxij = (9 and wg) g (9.8) 
3 otherwise 


ôx is the neighbors parameter. Compute the “neighbors” of x (i) and x ( j) by K-nearest 
neighbors method. 
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The LPP problem (9.7) in space X is updated as follows: 


Jipp(w) = max w'X'S.Xw 


s.t. w'X'D,Xw =1. ic 
The local structure information of X is contained in the matrices X™S,X and 
X'D,X. The magnitude of the diagonal element values indicates the magnitude of 
the role of the corresponding variables in preserving the local structure. The non- 
diagonal elements correspond to the correlation between the observed variables. 
Similarly, the optimization problem for PCA can be expressed as follows: 


Jpeca (w) = max wT XTX w 
r (9.10) 
st. w w=l. 


Based on the similarity of the optimization goals of LPP and PCA, combined 
with the component extraction idea of PCA included in PLS, we naturally consider 
fusing the LPP features into PLS to weaken the limitation of PLS, lack of local feature 
extraction capabilities. The simplest feature fusion method is to re-synthesize the two 
optimization goals, such as the GLPLS (Zhong et al. 2016), into a new optimization 
goal through some trade-off parameters. 


9.3 Introduction to the GLPLS 


GLPLS method is given in this chapter to obtain the relationship between the quality 
and measurement variables while maintaining the local characteristics as much as 
possible. The main idea is to integrate the LPP method to preserve the local structural 
characteristics and the PLS method to perform the relevant quality statistical analysis. 
As a result, GLPLS method is able not only to identify the latent characteristics 
direction for both the measurement and the quality data space but also to preserve 
(to the greatest extent possible) the local structural characteristics in the two hidden 
subspaces. 

Consider both the manifold structure for process variables X and the product 
output variables Y by introducing parameters À; and A2 to control the trade-off 
between the extraction of the global and local features. Therefore, the objective of 
GLPLS-based method is defined as 


Jorrıs(w, c) = arg max{w' XTY c + \;w'O,w + à2c"0,c} ity 
s.t. wlw = 1,c'c = 1, ` 


where 0, = XTS,X and 0, = YTS yY represent the local structure information of 
process variables and quality variables, respectively. Sx, Sy, D1, and D3 are the local 
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feature parameter of the LPP algorithm. Parameters À; and àz are used to control the 
weight coefficients between global and local features. 

It can be found from (9.11) that the objective function of GLPLS contains the 
objective function of the PLS algorithm wTXTYc and a part of the optimization 
problem of LPP algorithm wT XTS, Xw and c'Y'S,Ye. 

The optimization function (9.11) seems to be a good combination of the PLS algo- 
rithm global characteristics and the LPP algorithm local persistence characteristics. 
Is that really the case? Let us analyze the solution of the optimization problem first. 
To solve the optimization objective function (9.11), the following Lagrange function 
is introduced: 

p(w, c) =w X'¥c+ wO, w + dre" Oye 


(9.12) 

— m(w"w — 1) — m(cle — 1). 

Then, according to the conditions for extremum, (9.11) is resolved as follows 
(Zhong et al. 2016): 

Jotpis(w, c) = m +m. (9.13) 


Let A; = m, A2 = m, w is best projection vector, which is the corresponding 
eigenvector of the largest eigenvalue (I — 0,)' XTY (I E 0,) | YTX, c is best 
projection vector, which is the corresponding eigenvector of the largest eigenvalue 
(I — 0,) '¥'X( — 6,)' XTY, that is, 


(I —6,)'X'YU — 0) 'Y'Xw =4nmw 


(I — 0,) 'YTX(I — 0,)'X'Ye = 4mme. ae 
Equation (9.13) shows that the optimal solution of GLPLS is 7 + m, but in the 
actual calculation process (9.14), the optimal solution obtained by GLPLS algorithm 
is 772. Obviously, in most cases, the conditions for maximizing nı + m and nım 
are different. 

In order to explain the reason for this result, we once again return to the GLPLS 
optimization objective (9.11). Equation (9.11) is a global (PLS) and local (LPP) 
feature combination optimization problem. It is undeniable that this combination 
is reasonable to a certain extent. However, the latent variables of PLS are chosen 
to manifest their variation as much as possible, and the correlation between latent 
variables is as strong as possible. But the LPP method only needs to keep the local 
structure information as much as possible when constructing its latent variables. In 
other words, although the local features of the process variables (x (6, = X TS,X)) 
and the quality variables (y(#, = Y TS,Y)) are enhanced, the correlation between 
the local features is not enhanced. Therefore, this direct combination of global and 
local features may lead to erroneous results. 

In the GLPLS method, the LPP is used to maintain local structural features. 
Locally linear embedding (LLE) is also a commonly used manifold learning algo- 
rithm. Like the LPP algorithm, the LLE algorithm also converts a global nonlinear 
problem into a combination of multiple local linear problems by maintaining local 
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structural information, but the LLE algorithm has fewer adjustable parameters than 
the LPP algorithm. Therefore, the LLE algorithm is another good solution to the 
problem of a strongly local nonlinear process system. The LLE algorithm has been 
briefly introduced in Chap. 11, and its optimization objective function is transformed 
into a general maximization form. Therefore, in the next section, we combine the 
PLS method and the LLE/LPP method in a new way, trying to maintain the global 
and local structural information of the process variables and quality variables at the 
same time, and enhance the correlation between them. 


9.4 Basic Principles of GPLPLS 


9.4.1 The GPLPLS Model 


According to the Taylor series expansion, a nonlinear function can be written as 
follows: 
F(Z) = A(Z — Zo) + g(Z — Zo), (9.15) 


where A(Z — Zo) and g(Z — Zo) represent the linear part and the nonlinear part, 
respectively. In many real systems, especially near the balance point (Zo), the linear 
part is primary and the nonlinear part is secondary. The PLS method is difficult to 
model nonlinear systems well. Because the PLS method uses the linear dimensional- 
ity reduction method PCA to obtain the principal components, which only establishes 
the relationship between the linear part of the input variable space (X) and the output 
variable space (Y). In order to obtain a better model with local nonlinear features, the 
KPLS model (Rosipal and Trejo 2001) maps the original data to a high-dimensional 
feature space, while the LPPLS model (Wang et al. 2017) transforms nonlinear fea- 
tures into a combination of multiple local linearized features. Both of these methods 
can solve some nonlinear problems. However, the feature space of the KPLS model 
is not easy to determine, and the main linear part of the LPPLS model is more suitable 
to be directly described by global structural features. 

In fact, the PLS optimization (9.5) includes two goals for the selected latent 
variable: one is that the latent variable contains variance varying as much as possible 
and the other is that the correlation between the latent variables of the input space 
and the output space is as strong as possible. Although the GLPLS model combines 
global and local feature information, the combination of the two is not coordinated. 
How does one combine the two features to maintain the same objective? According 
to the expression of a nonlinear function (9.15), the input and output spaces can 
both be divided into two parts: the linear and nonlinear parts. By introducing local 
structure information, the nonlinear part can be transformed into a combination of 
multiple local linear problems. 

Inspired by the role of the PCA model (wT X TXw) in the PLS model (wT XTYc¢) 
and the limitation of the GLPLS algorithm, this section proposes a novel dimen- 
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sionality reduction method. It combines global (PCA) and local (LLE/LPP) features 
to extract latent variables of nonlinear systems. Therefore, the input space X or the 
output space Y is mapped to the new feature space X p and Y p, respectively. The new 
feature space contains a global linear subspace and multiple local linear subspaces. 
Use the new feature space X p and Y p to replace the original space X and Y, respec- 
tively. Consequently, a new objective function of the global plus local projection 
to latent structure (GPLPLS) method is shown in the following new optimization 
objective 


Jores (w, c) = arg max{w' XTY rc} 
T T (9.16) 
st.ww=l,cec=1, 


1 1 
where X r and Y p satisfy Xp = X + A, Oz and Yr = Y + 05. 
It is found that the new feature spaces X and Y are both divided into linear part 


1 1 
(X, Y) and nonlinear part (A, 9%, Ay05 ), similar as (9.15). Figure 9.2 shows the prin- 
ciple of the GPLPLS method. Here X goba and Y giopai are the corresponding linear 
part in the input space and the output space, respectively. They will be projected to 
the dimensionality reduction space by the traditional global projection method, PLS. 
X local and Y jpcqi are the corresponding nonlinear parts, which will be dimensionality 
reduction projected by the local-preserving projection method (LPP). 


Outer relationship 


= Proj a ma 
( E E 8) 
Xsobal @ @ o o 
a o 


Outer relationship 


Fig. 9.2 The schematic diagram of the GPLPLS method 
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The core of extracting the principle components is PCA. So the linear model of 
X and Y is established by (9.16). It actually contains two relations: one relationship 
is that the input and output spaces are divided into “score” and “load” (external rela- 
tionship), and the other relationship is the relationship between the latent variables 
of the input space and output space (internal relationship). These two relationships 
can also be seen from the schematic diagram (Fig. 9.2) of the GPLPLS model. Obvi- 
ously, we can keep only the internal model, or the external model, or retain the local 
structure information of the internal model and the external model at the same time. 
Therefore, by setting four different values of A, and \,, four different optimization 
objective functions can be set as follows: 


(1) PLS optimization objective function: A, = 0, Ay = 0. 

(2) GPLPLS, optimization objective function: A, > 0, Ay = 0. 
(3) GPLPLS,, optimization objective function: A, = 0, A, > 0. 
(4) GPLPLS,+, optimization objective function: A, > 0, Ay > 0. 


9.4.2 Relationship Between GPLPLS Models 


The optimization objective function of the GPLPLS method is given by (9.16). There 
are three GPLPLS models according to different values of A, and Ay. What is the 
relationship between the three GPLPLS models? What is the difference between 
their modeling? These issues will be discussed in this section. 

Suppose the original relationship is Y = f(X). Local linear embedding or local- 
preserving projection can be regarded as the equilibrium point of system linearization. 
From this perspective, the models with different combinations of A, and A, are as 
follows: 


(1) PLS model: ¥ = AoX. 

(2) GPLPLS, model: ¥ = A[X, x, ]. 

(3) GPLPLS, model: Ê = A[X, f (x1,)]- 

(4) GPLPLS,+, model: ¥ = A3[X, x, f(xi,)]- 


Here x, (i = 1,2,...,k,) and yi, = fœ = 1,2, ..., ky) are the local feature 
points of the input space and output space, respectively. Ag, Aj, Az, and A}; are the 
model coefficient matrices. Obviously, PLS uses a simple linear approximation of 
the original system. This approximation effect is generally not good for a nonlinear 
relatively strong system. The GPLPLS uses the method of spatial local decomposition 
and approximates the original system with the sum of multiple simple linear models. 
GPLPLS, or GPLPLS, is a special case of GPLPLS,,,. It seems that these three 
combinations have embraced all the possible GPLPLS models. Let us go back to the 
GPLPLS,, model’s optimization function again. 
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1\T 1 
JopLpis,,,(W, c) = arg max{w" (x + A0) (y + A03) c} 


_ TyT Tait 
= arg max |w X Yc+Aà,w 0} Ye (9.17) 


il 1 1 
+ AywTXTORe + A,Ayw"8? Oze} 


s.t. ww = l,c'c=1. 


Obviously, (9.17) contains two coupled components (02 Ty and X Tg: ), which 
represent the correlation between the linear primary part and the nonlinear part. In 
some cases, these coupled components may have a negative impact on modeling. 
On the other hand, in addition to the external relationship between the input and 
output space which can be extended to a combination of linear and nonlinear, the 
internal relationship between the input and output space (the final model) can also be 
described as a combination of linear and nonlinear. Therefore, it is natural that we can 
model the linear and nonlinear parts without considering the coupling component 
between the two parts. Correspondingly, there is no need to consider the coupling 
component between the linear and nonlinear parts in the optimization function of the 
model. Therefore, the optimization objective of the following GPLPLS,, model can 
be obtained: 


ip 1 
JGPLPLs,, (W, ¢) = arg max{w' X'Ye + xy w103 0? c} (9.18) 


s.t. wlw = 1,c'c = 1. 


Among them, Axy parameters control the trade-off between global and local features. 


9.4.3 Principal Components of the GPLPLS Model 


In this section, we will introduce how to obtain the principal components of the 
GPLPLS model. In order to facilitate the comparison with the traditional linear PLS 
model, denoted by Eor = X p and For = Y p. The optimization objective functions 
of four GPLPLS models are included in the following optimization objectives: 


Ip 1 
JorLrLs (w, c) = arg max{w" XTY rc + As w102 0? c} (9.19) 


s.t. wlw = 1l,c'c= 1, 


where at least one of [A,, Ay] and Axy is nonzero. The steps of obtaining latent 
variables of the GPLPLS model (9.19) are as follows. 

First, the Lagrangian multiplier factor is introduced to transform the objective 
function (9.19) into the following unconstrained form: 
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1 1 
TpT Tgi! 92 
W (w1, c1) =w]; Eor Forci + Axyw; 0x 0561 


> A (9.20) 
— à (w; wi — 1) — à2(ci c1 — 1). 


Let (OW)/(Ow,) = 0 and (OW)/(0c,) = 0, we can find the optimal solution of 
w; and cı. Then the objective function (9.19) is transformed as 


ir 17T Ipod 
[Eip For +865] [E] For +30 0] w= Ow: 02D 
7 ip 197 A Iip 1 j 
| For Eor + \xy93 0: | | For Eor + X93 6: |e, =e, (9.22) 


1 1 
where 0 = wT X Ty ret As wT03 0 3 c. The target vectors w; and c; are calculated 
from (9.21) and (9.22). After obtaining the target vector (that is, the direction vector 
of the latent variables), the latent variables ¢; and u4, the load vectors p, and q,, and 
the residual matrices E; and F4 can be calculated as follows: 


tı = Eorw,, uy = Force, (9.23) 
E) .t} Flt; 

»=—“~_, p=, (9.24) 
Ilé: Il Ilé: Il 

Eir = Eor — ti p}, Fır = For — tiq}. (9.25) 


Similar to the PLS method, the other latent variables of the GPLPLS model can 
be obtained by continuing to decompose the residual matrices E;z and F;z(i = 
1,2,...,d — 1). Usually, the first d latent variables are used to produce a better 
predictive regression model and d can be determined by the cross-validation test 
(Zhou et al. 2010). 

The above is the establishment of the GPLPLS model and its principal component 
extraction process. Now let’s compare GPLPLS model with the GLPLS model. 

First of all, GPLPLS likes the GLPLS method at the main idea, i.e., to combine 
local and global structural features (covariance). Obviously, the GPLPLS method 
integrates global and local structural features better than the GLPLS method. Dif- 
ferent from the GLPLS method, the GPLPLS method not only maintains the local 
structural features, but also extracts the relevant information in the input space and 
output space as much as possible. Therefore, the GPLPLS method can extract the 
largest global correlation as much as possible, while extracting the local structural 
correlation between process and quality variables. 

Compared with the LPPLS method (Chap. 10) and LLEPLS method (Chap. 11), 
all the characteristics of the LPPLS method are described by local features. This 
indiscriminate description has advantages in strongly nonlinear systems, but it may 
not necessarily have advantages in linearly dominant but locally nonlinear systems. 
The GPLPLS method proposed in this chapter is a process aimed at linear advan- 
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tages, but it still maintains some nonlinear relationships. It integrates global features 
(covariance) and nonlinear correlation (multivariance) as much as possible. 


9.5 GPLPLS-Based Quality Monitoring 


9.5.1 Process and Quality Monitoring Based on GPLPLS 


The GPLPLS-based monitoring method is very similar to the PLS method. The 
common monitoring indicators of PLS are T? and SPE. In Chap. 11, it has been 
explained in detail that SPE statistics is not suitable for monitoring residual space 
of PLS. Therefore, in this chapter, the process monitoring based on the GPLPLS 
method uses statistics to monitor the principal component subspace and the remaining 
subspace. The monitoring process is also divided into two parts: offline training and 
online monitoring. The detailed process is as follows. 

The input space X and the output space Y of the GPLPLS model are mapped to 
a low-dimensional space defined by a small number of latent variables [¢,,..., tq]. 
The decomposition of Eor and For is as follows: 


d 
Eor = So tip} + Eo = TP" + Eor 


i=l 


F (9.26) 
For = Xota; + Foz = TQ" + For, 
i=l 
where T = [ti, t2,..., tqa] is the score matrix. P=[p,,..., pa] and Q = 
[91,---,+4q] are the load matrices of the process variable Eo, and the quality variable 
For, respectively. Use Eor instead of t;: 
1 
T = Eop R = (1 + A52) EoR, (9.27) 


where R = [r),..., ra] is the decomposition matrix, and 


i=l 


ri =] [ (n - wjp5) wi. 


j=1 


Itis noted that Eor contains the results of locality-preserving learning. Operations 
(9.26) and (9.27) are executable during the model training. But the data is sampled real 
time during the process of online monitoring. The individual real-time data cannot 
be constructed for the transformational matrix $, or S, for the locality learning. 
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Considering the practical application of (9.26) and (9.27), they should be transformed 
as the decomposition of normalized matrices Eo and Fo, 


Eo = ToP" + Eo (9.28) 
= T =- - T R 
Fo=T)Q +Fo=EoRQ + Fo, (9.29) 
where To = EoR, Eo = Eo — ToP", and Q = TẸ Fo. 


During the online monitoring for new samples x and y (standardized data), an 
oblique projection is introduced in the input space x: 


x= +x (9.30) 
¥=RP'x (9.31) 
xe = (I — RP")x. (9.32) 


The statistics Th and T2 of the principal component space and the remaining 
subspace are calculated as follows: 


t=R'x (9.33) 
l =i 
Tie =1 A t= "| Tito} t (9.34) 
= 
TË := xT A7 xe = x (atx, | Es (9.35) 
n= 


where A and A, are covariance matrices. Ta and T? are statistics with the threshold 
Thpc,a and The,, respectively. Considering the statistics Ta and T? are not obtained 
through normalized data Eo, and the output variables may not obey the Gaussian 
distribution. Therefore, the corresponding thresholds cannot be calculated from F- 
distribution. So their probability density functions should be estimated first by non- 
parametric kernel density estimation (KDE) (Lee et al. 2010). 


The fault diagnosis logic based on the GPLPLS model is as follows: 


Toe > Thpe,a Quality-relevant faults 
Thi > Thpc,a Or T? > Thea  Process-relevant faults (9.36) 


Toe < Thpca and T? < Thea Fault free 


The process monitoring of GPLPLS algorithm with multiple input and multiple 
output data is as follows: 


(1) Standardize the original data X and Y. Calculate Tọ, Q, and R based on GPLPLS 
algorithm (9.28) and (9.29). Determine the number of principal components d 
by cross-validation. 
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(2) Construct the input remaining subspace xe. 
(3) The thresholds are calculated according to the non-parametric KDE estimation, 
and the fault diagnosis is performed with the detection logic (9.36). 


9.5.2 Posterior Monitoring and Evaluation 


Many quality-related process monitoring methods have been verified on the well- 
known TE process simulation platform. The goal of most methods is to make the 
quality-related alarm rate as high as possible, but the reasonability of monitoring 
result seems to receive little attention. Therefore, similar to the performance evalua- 
tion index of the control loop, we introduce a posterior monitoring assessment (PMA) 
index to evaluate the reasonability of quality-related alarm rate. PMA is defined as 
follows: 
s (yi) 


(yF) 


where E(-) is the mathematical expectation, yy and yp are the output data of the 
training data set and the output data of the fault data set, respectively. It is noted 
that they are both normalized by the mean and standard deviation of yy. PMA — 1 
indicates that the quality of the fault data is close to normal operation; PMA > 1 
indicates the data quality is better than the normal. Moreover, PMA far from 1 means 
that the quality is very different from the normal, and the corresponding quality- 
related index T? (PLS method) or T3. (GPLPLS method) should be higher, and the 
others should be lower. 

However, the widespread controllers reduce the impact of certain failures, espe- 
cially small fault. So a single PMA indicator cannot truly reflect the dynamic changes, 
two PMA indicators are adopted to describe dynamic and steady-state effects, respec- 
tively, 


PMA = 


(9.37) 


a (Y? (ko : ki, i 
PMA; = min Paun) es E ee (9.38) 
(YẸ (ko : ki, i)) 


EH l; 2h (9.39) 


~ 2 . : 
PMA = mia ey 


i (YẸ (kz : n, i) 


where k = 0, 1, 2 is constant. It is noted that the worst strategy is selected in order 
to ensure the rationality of the evaluation. Moreover, the two PMA indicators are 
only used to test whether the previous fault detection results are reasonable. Their 
evaluations are objective but not indicate whether the fault is quality related, com- 
pared with the detection based on GPLPLS model. The quality testing is necessary 
for further diagnosis. 
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9.6 TE Process Simulation Analysis 


Process monitoring and fault diagnosis based on the GPLPLS model are tested on 
the TE simulation platform. The monitoring performance of several models, such 
as PLS, a concurrent projection to the latent structures (CPLS) (Qin 2012), and 
GPLPLS, are compared. The input and output spaces are projected and decomposed 
into five subspaces in CPLS: input principle subspace, input residual subspace, output 
principle subspace, output residual subspace, and joint input-output subspace. Just 
focusing on the quality-related faults, the principle and residual subspaces of input are 
replaced by the input remaining subspace x, in CPLS model, and the corresponding 
monitoring statistics are replaced by T The output principle and residual subspaces 
in the CPLS model are not considered in order to highlight process-based quality 
monitoring. Two different data sets are used from (Zhang et al. 2017) and (Wang 
et al. 2017). 


9.6.1 Model and Discussion 


The input matrix is composed of process variables [KMEAS(1:22)] and manipulated 
variables [KMV(1:11), except XMV(5) and XMV(9)]. The output matrix is com- 
posed of mass variable [XMEAS (35), XMEAS (36)]. The training data is normal 
data IDV(0) and the test data is 21 fault data IDV(1-21). The threshold is calculated 
based on the confidence level 99.75% (see equation (1.10) for detail). 

The simulation parameters of the GPLPLS model, especially the GPLPLS,, 
model) are ky = 22, ky = 23, Ay = Ay = 0, Axy = 1, ko = 161. Note that the local 
nonlinear structure features are extracted by the LLE method. Number of princi- 
pal components of PLS, CPLS, and GPLPLS models are 6, 6, and 2, respectively, 
determined by the cross-validation method. kj = n = 960, k2 = 701. The detection 
results including FDR, FAR, and indicator PMA are listed in Table 9.1. 

With these two PMA indices in Table9.1, 21 faults are divided into two 
types: quality-independent faults (PMA; > 0.9 or PMA; + PMA» > 1.5 ) including 
IDV(3,4,9,11,14,15,19) and quality-related faults. Furthermore, the quality-related 
faults are further subdivided into four types: 


Type 1: fault has a slight impact on quality, [IDV(10,16,17, and 20)], 0.5 < PMA; < 
0.8 i = 1,2. 


Type 2: fault is quality recoverable, [IDV(1,5, and 7)], PMA; < 0.35 and PMA, > 
0.65. 


Type 3: fault has a serious impact on quality, [IDV(2, 6, 8, 12, 13, and 18)], PMA; < 
0.1 i= 1,2. 


Type 4: fault causes the output variables to drift slowly, [IDV(21)]. 
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Table 9.1 FDRs of PLS, CPLS, GPLPLS,,, and PMA 

IDV PLS CPLS GPLPLS,y PMA 

eA T T2 T? T2; T: PMA; |PMA2 
1 0.6930 
2 0.0580 
3 0.8670 
4 0.9277 
5 0.9461 
6 0.0026 
7 0.9721 
8 0.0951 
9 0.8465 
10 0.5064 
11 0.6956 
12 0.0232 
13 0.0208 
14 0.8580 
15 0.5710 
16 0.5355 
17 0.6862 
18 . : 0.0037 
19 0.50 41.13 0.00 39.00 0.00 36.13 | 0.9453 | 0.8859 
20 30.50 90.75 | 20.13 88.25 | 12.50 90.25 |0.6700 | 0.7366 
21 41.88 47.63 | 37.25 45.75 | 21.25 50.75 | 0.2342 | 0.1063 


This classification is not only a preliminary result depending on the choice of 
parameters ko, kı, and kz, but it also has a reference value. All methods show the 
consistent results for the serious quality-related faults, which are not discussed in the 
next fault detection analysis. 


9.6.2 


Fault Diagnosis Analysis 


Form the above results, it is found that for some faults, their detection results are 
not consistent with different methods, including quality-recoverable faults, slight 
quality-related faults, and quality-independent faults. The detailed analysis for the 
three situations is given below. For all the monitoring graphs, the horizontal axis 
represents the sample, the vertical axis represents the statistics (the picture above 
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(a) IDV(1) (b) IDV(5) (c) IDV(7) 


Fig. 9.3 Output prediction for IDV(1), IDV(5), and IDV(7) using the GPLPLS,, method 


represents T? the picture below represents T2), and the red dotted line is the thresh- 
old with confidence level 99.75%. The blue line is the actual monitoring value. For 
all prediction graphs, the horizontal axis represents the sample, the vertical axis rep- 
resents the output value, the blue dashed line is actual value, and the green line is for 
the prediction. 

(1) Quality-recoverable fault 

Quality-recoverable faults include IDV (1), IDV (5), and IDV (7). They are all step- 
change faults, but the feedback or cascade controller can reduce their effect on qual- 
ity during the actual process. Therefore, the quality variables in the faults IDV(1), 
IDV(5), and IDV(7) should return to normal. The output prediction is shown in 
Fig. 9.3. As an example, the corresponding fault monitoring results for IDV(7) are 
shown in Fig.9.4 which correspond to the PLS, CPLS, and GPLPLS,, models, 
respectively. Here the statistics T, and T? detected the input space for process- 
related faults. For the GPLPLS,., model, the value of the Ta statistic returns to the 
normal value, while the TŽ statistic still maintains a high value. This means that 
these faults are quality-recoverable faults. PLS and CPLS reported that these faults 
are quality-related faults but give many false alarms, especially for IDV (7). The sta- 
tistical value of Te is also very close to the threshold, but still exceeds the threshold. 
They still indicated the fault alarm even when the operation have returned to normal 
under the controller. They fail to grasp the essence of the fault detection problem 
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Fig. 9.4 PLS, CPLS, and GPLPLSxy monitoring results for IDV (7) 
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Fig. 9.5 Output prediction for IDV (4), IDV (11), and IDV (14) using the GPLPLS,, method 


with recoverable quality. In this case, the GPLPLS,, method can accurately reflect 
the process and quality changes. 

(2) Quality-independent fault 

Quality-independent faults include IDV (4), IDV (11), and IDV (14), but they are 
related to process. All these faults are related to the reactor cooling water, and these 
interferences hardly affect the quality of output products. The corresponding output 
quality prediction of GPLPLS,., methods is shown in Fig. 9.5. The monitoring results 
for IDV(14) by PLS, CPLS, and GPLPLS,,, methods are shown in Fig. 9.6. In the 
GPLPLS,., model, T are almost under the threshold, which indicates that these 
faults are not related to quality. But for PLS and CPLS models, these faults are 
detected both in T3: and T?. In other words, PLS or CPLS model shows that these 
interferences are related to quality. Compared with PLS, CPLS method can filter 
out fault alarm to a certain extent in Us but still has higher alarm than GPLPLS,,. 
For quality-independent fault, PLS and CPLS have a high detection rate, but fails to 
indicate the quality-independent faults. 

(3) Slight quality-related faults 

Faults, such as IDV(10), IDV(16), IDV(17), and IDV(20), have a slight impact 
on quality. Few people study this type of failure. Their quality-related alarm rates 
are similar to quality-recoverable faults. Although they are quality related, they have 
little impact on quality. Their Th value of related monitoring statistics is relatively 
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Fig. 9.6 PLS, CPLS, and GPLPLS,, monitoring results for IDV (14) 
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Fig. 9.7 Output predicted values for IDV (16), IDV (17), and IDV (20) using the GPLPLS xy method 


ee, O o Proy 


r 
G wk fare ES 


(a) PLS (b) CPLS (c) GPLPLSsy) 


Fig. 9.8 PLS, CPLS, and GPLPLS,, monitoring results for IDV (20) 


small. To some extent, these faults can also be regarded as failures that have nothing 
to do with quality. Many methods, such as the PLS method, fail to detect them 
accurately. The output prediction values of GPLPLS,, models are shown in Fig. 9.7. 
The monitoring results of the three models for fault IDV (20) are shown in Fig. 9.8. It 
can be seen that the monitoring results of the GPLPLS,, model are the most accurate, 
and the PLS and CPLS models give false alarm results. In the GPLPLS,, model, 
process changes better match quality changes. 

From the three situations analyzed above, it can be seen that the GPLPLS method 
can filter harmful alarm situations. It can be used for minor quality-related failures, 
quality-unrelated failures, and quality-recoverable failures. There are two possible 
reasons for the good fault diagnosis performance of the GPLPLS method: first, the 
principal component of the GPLPLS method is based on the global features of nonlin- 
ear local structural features, and the method enhances its nonlinear mapping ability. 
Secondly, the GPLPLS method uses a non-Gaussian threshold, which makes it pos- 
sible to process the signal that does not necessarily satisfy the Gaussian assumption. 
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9.6.3 Comparison of Different GPLPLS Models 


For the same data set above, the FDRs of the other three GPLPLS,, GPLPLS,, and 
GPLPLS,., models (local nonlinear structural features are all extracted by the LLE 
method) are shown in Table9.2, where K = [k,, ky]. It can be seen from the table 
that the results of these methods are very good, and consistent conclusions are drawn. 
Especially the FDR of GPLPLS,.,, model and the GPLPLS,,, model are very similar. 

In order to discuss these models more clearly, fault IDV (7) is selected for further 
analysis. It can be seen from Table 9.2 that the monitoring results of IDV(7) by the 
GPLPLS, model are obviously inconsistent with other methods. T? statistics give a 
higher alarm (79.25%). According to the previous analysis, this alarm is an annoying 
false alarm. The other three models have relatively low alarm rates for fault IDV (7), 
near 26%, which means that the monitoring effect is very good. The possible reason 
for false alarm is that the GPLPLS, model only enhances the local nonlinear structure 
characteristics in the output space. It is linear to the input space and the output space 


Table 9.2 FDRs of GPLPLS methods with LLE local feature 


IDV GPLPLS, GPLPLS, GPLPLS,+, GPLPLS,, 
ky=16 ky=16 K = [22, 24] K = [22, 23] 
Toe T Toe T? Taz T? Te T? 

1 99.75 
2 98.38 
3 1.38 
4 100.00 
5 100.00 
6 100.00 
7 100.00 
8 97.88 
9 1.25 
10 84.50 
11 76.75 
12 99.88 
13 95.25 
14 100.00 
15 1.63 
16 43.75 
17 96.88 
18 86.38 90.00 | 88.88 90.00 | 87.00 90.00 | 87.00 89.88 
19 0.00 38.25 | 0.00 38.50 | 0.00 37.75 | 0.00 37.38 
20 8.63 90.63 | 22.50 89.75 | 12.50 90.50 | 12.50 90.38 
21 14.00 52.75 | 31.63 44.25 | 21.25 49.63 | 21.25 50.25 


168 9 Global Plus Local Projection to Latent Structures 


is nonlinear. Process monitoring results may be better. However, the input space of 
the TE simulation process may also have strong nonlinearity, which leads to the 
poor monitoring results of GPLPLS, model, and the other three models show higher 
consistency with this type of fault. 

The above results of the GPLPLS models are obtained by combining with the LLE 
method to retain local nonlinear structural features. Below, the monitoring results of 
the GPLPLS model combined with another local retention algorithm LPP method 
are given, as shown in Table 9.3, where X = [0,, oy]. It can be seen that Table 9.3 
gives consistent conclusions, so the analysis will not be performed here. 

Many methods have the similar fusion idea of global projection and local preserv- 
ing, such as GLPLS, LPPLS, and others. These methods all need to adjust parameters, 
and different parameters have different results. In order to be as consistent as possible 
with the existing results of other methods, we chose the same data set in Wang et al. 
(2017) for the following tests. 


Table 9.3 FDRs of GPLPLS methods with LPP local feature 


IDV GPLPLS, GPLPLS y GPLPLSx+y GPLPLS xy 

K = [22, 24] K = [22, 23] 

X= [2,1] ® = [0.05, 1.3] 

Te Te 

1 99.75 
2 98.38 
3 1.25 
4 100.00 
5 100.00 
6 100.00 
7 100.00 
8 97.88 
9 1.25 
10 84.25 
11 77.00 
12 99.88 
13 95.25 
14 100.00 
15 1.88 
16 43.00 
17 97.00 
18 89.88 
19 0.00 36.75 0.13 39.00 0.00 37.13 0.00 37.13 
20 10.25 90.38 |25.13 89.38 |11.75 90.38 |11.25 90.38 
21 20.38 49.50 |29.50 43.63 | 20.88 49.00 | 20.88 49.38 
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In the following comparison experiment, input variable matrix X is com- 
posed of process variables [XMEAS (1: 22)] and 11 manipulated variables 
[XMEAS (23 : 33)] except XMV (12). The quality variable matrix Y includes 
XMEAS (35) and XMEAS (38). The model parameters based on the combination 
of manifold learning algorithm and PLS are set as follows: 

(1) The GLPLS model: 5, = 0.1, 5) = 0.8, ky = 12, ky = 12. 

(2) The LPPLS model: 6, = 1.5, ôy = 0.8, ką = 20, ky = 15. 

(3) The GPLPLS model: k, = 11,k, = 16 (mainly refers to the GPLPLS,., 
model). 

Table 9.4 lists the FDR values of different quality-related monitoring methods, 
corresponding to PLS, CPLS, GLPLS, and GPLPLS models, and the correspond- 
ing detection threshold is calculated with confidence level of 99.75%. The last two 
columns are FDRs calculated based on the PMA value of this data set. 

It can be seen from Tables 9.1 and 9.4 that although the data sets are different, the 
results of PMA are similar. Therefore, the quality-related monitoring results should 
be similar, and it is obvious that the GPLPLS model gives consistent conclusions. 
The higher FDR of other models than GPLPLS is due to not good to distinguish 


Table 9.4 FDRs comparison for different quality-related methods 


IDV PLS CPLS GLPLS GPLPLS PMA1 PMA2 
1 99.13 96.13 99.75 66.75 0.20 0.68 
2 98.00 81.25 97.63 92.75 0.07 0.06 
3 0.38 0.50 1.13 0.50 0.77 1.19 
4 0.63 0.13 98.88 0.25 0.89 1.02 
5 21.88 20.38 21.38 17.63 0.30 1.04 
6 99.25 99.25 99.38 96.38 0.00 0.00 
7 36.75 35.63 83.63 27.75 0.14 1.03 
8 92.50 87.75 93.38 74.88 0.06 0.07 
9 0.63 0.38 0.75 0.00 0.90 0.81 
10 30.00 28.00 23.13 13.88 0.59 0.81 
11 1.38 0.25 53.50 0.38 0.78 0.76 
12 87.50 84.75 87.75 75.50 0.04 0.03 
13 93.88 85.00 95.25 79.75 0.02 0.02 
14 33.50 1.63 96.88 0.00 1.07 0.77 
15 0.63 0.75 1.50 0.50 0.90 0.57 
16 14.25 12.63 9.00 8.00 0.78 0.53 
17 56.00 37.13 96.75 1.63 0.64 0.70 
18 88.00 88.00 90.25 86.75 0.01 0.00 
19 0.00 0.00 2.50 0.00 0.95 0.75 
20 26.63 27.75 36.25 10.25 0.67 0.78 
21 29.88 24.50 44.38 8.63 0.23 0.09 
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whether these faults are quality related. Although GLPLS has similar fusion idea 
of global feature and local structure, its weak monitoring performance is caused by 
the inappropriate parameters and model construction. Because it is difficult to select 
suitable parameters, the parameter determination method is still an open issue. 

In summary, GPLPLS model shows good monitoring performance. It is suitable 
for the combination of global structure and local structure features, so the output 
prediction results and fault monitoring results of the model are better than other 
models. 


9.7 Conclusions 


This chapter proposes a new statistical monitoring model based on the global plus 
local projection to latent structure (GPLPLS) model. This model not only main- 
tains the global and local structural characteristics of the data, but also pays more 
attention to the correlation between the extracted principal components. First, the 
GLPLS method is introduced, and it is pointed out that the model construction of 
this method is unreasonable, and then the GPLPLS method is proposed to maintain 
the global and local features with a new structure. Then a monitoring model based 
on the GPLPLS method is established, and the monitoring performance of the pro- 
posed method is verified on the TE process simulation platform. The results show 
that compared with PLS, CPLS, and GLPLS, GPLPLS method has better process 
monitoring performance for quality-related fault. 
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Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
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Chapter 10 A) 
Locality-Preserving Partial Least ciecie; 
Squares Regression 


This chapter proposes another nonlinear PLS method, named as locality-preserving 
partial least squares (LPPLS), which embeds the nonlinear degenerative and structure- 
preserving properties of LPP into the PLS model. The core of LPPLS is to replace 
the role of PCA in PLS with LPP. When extracting the principal components of t; 
and u;, two conditions must satisfy: (1) t; and u; retain the most information about 
the local nonlinear structure of their respective data sets. (2) The correlation between 
t; and u; is the largest. Finally, a quality-related monitoring strategy is established 
based on LPPLS. 

First, the geometric interpretation of PCA in PLS and LPP is introduced. LPPLS 
model and LPPLS-based quality-related process monitoring method are proposed. 
Here three different types of LPPLS models are also given in the same framework, 
facing three nonlinear cases: nonlinearly correlated in the input space X or the out- 
put space Y, as well as between them. A typical algorithm for extracting principal 
components is derived. Then, the feasibility and effectiveness of LPPLS method is 
verified by artificial 3-D data and Tennessee Eastman Process simulations. 


10.1 The Relationship Among PCA, PLS, and LPP 


For the normalized data sets of process variables X = [x7(1),x"(2), ...,x7 mJ” € 
R”*™ (x e R'*") and quality variable Y = [y"(1), y™(2), a yT] e R"™! 
(y € R!*!), where m and / are the dimension of the process and quality variables 
spaces, and n is the number of samples, the principal component extraction of PCA, 
LPP, and PLS is actually equivalent to the following constrained optimization prob- 
lem. 
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Jeca (w) = max wT X'Xw (10.1) 
s.t.wiw = 1 
Jipp(w) = max w'X'S,Xw (10.2) 
s.t. w'XTD,Xw = 1 
Jp_s(w, c) = max w” X'Ye (10.3) 


s.t.wi'w=l,c'c=1 


The meaning of related variables such as w, c has been given in Chap.9. Also, in 
Chap. 9, to weaken the limitations of PLS’s lack of local feature extraction capa- 
bilities, the input space X and the output space Y, are mapped into a new feature 
space X p and Y p that includes a global linear subspace and a plurality of local linear 
subspaces. Consequently, the following new optimization objective function of the 
global plus local projection to latent structures (GPLPLS) method is immediately 
obtained using the feature space X or Yp to replace the original space X or Y, 


Jopipis(w, c) = arg max{w' XTY rc} 
T T (10.4) 
s.t. w w= l,c c= l, 


where Xr = X + A02, Yr=Y+ A0. 

Although adding local features to the global features makes the GPLPLS model 
show excellent performance in fault detection, the GPLPLS model does not fully 
implement local feature extraction or its local features are only extracted approxi- 
mately. The main reason is that the constraint condition of the GPLPLS model is still 
the constraint condition of PCA or PLS. Of course, this combination way generally 
cannot guarantee the constraints of PCA and LPP at the same time. 

Only the nonlinear part of the function is described by the local features, and the 
linear part is still characterized by the traditional covariance matrix in Chap. 9. In fact, 
the characteristics of the linear part can also be described by local characteristics. 
In this way, we can regard the linear part and the nonlinear part as a whole, thereby 
avoiding unnecessary parameter trade-offs. In the following context, we attempt to 
analyze the differences and similarities between PCA and LPP. 

The local characteristics of X of LPP are contained in the matrices XTS, X and 
X'D,X.To ey the ay of LPP and PCA, the matrix S, and D, are decom- 


posed into s2" Sx : and DŻ Tp}, respectively. Then LPP criteria (10.2) is further 
transformed as 
Jipp(w) = max wi XXuw 


(10.5) 
s.t.w'M)M,w = 1, 
1 1 
where M, = Di X, Xm = S} X. 
Comparing (10.5) and (10.1), it can be found that the structure in the mathematical 
description of the optimization problem of LPP and PCA is similar. “PCA selects 
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a subspace consisting of the eigenvectors corresponding to the largest eigenvalues 
of the global covariance matrix, while LPP selects a subspace consisting of the 
eigenvectors corresponding to the smallest eigenvalues of the local covariance matrix 
(He et al. 2005)”. Therefore, LPP can replace PCA in the PLS decomposition process, 
thus achieving the preservation of strong local nonlinearity. 

PCA is used to extract a set of components that transforms the original data X 
to a set of t-scores T in the PLS criteria (10.3) of forming latent variables. PCA 
and PLS only extract global linear features and therefore do not reflect the local 
information of the sample and its nonlinear features. Actually PCA is not the only 
method of extracting principle components. LPP, converting the global nonlinearity 
into a combination of multiple local linearities, also can be used for extracting prin- 
ciple components. Therefore, LPP is suitable for systems with strong local nonlinear 
features. 


10.2 LPPLS Models and LPPLS-Based Fault Detection 


10.2.1 The LPPLS Models 


Based on (10.3), the two criteria for selecting latent vectors u; and t; for PLS are as 
follows: 


(1) The linear variation on latent vectors is manifested as much as possible; 
(2) The correlation between is as strong as possible. 


The optimization objective for extracting the first component pairs (t1, u1) is 


Jors (w1, c1) = max w] X "Yc; 
T T (10.6) 
s.t. ww = l,cjc =l. 


The optimization objective (10.6) is used for fast extraction of principal com- 
ponents in PLS. Define Ey = X, Fo = Y, then the latent variables t; and cı are 
calculated by tı = Eow; and u; = Foc,, where cı and w; are the eigenvectors 
corresponding to the maximum eigenvalues of the following matrices. 


EĻ FoF) Eow, = w; (10.7) 
FEE} Foc; = 6c. (10.8) 


Considering the similarity between LPP and PCA discussed in the previous 
section, LPP is used to extract the principle components (10.3) in PLS decompo- 
sition instead of PCA, i.e., the LPPLS model. Three LPPLS models (types I, II, and 
IID) are developed to address the different nonlinear relationships. 
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The type I LPPLS model is given to deal with this case where the input space X 
has a nonlinear relationship and the correlation between the input X and the output 
Y is linear. The principal components of the input space X of the type I LPPLS are 
extracted by LPP and the principal components of the output space Y are extracted 
by PCA. The optimization objectives are as follows: 


TyT 
JiprLs, (W, c) = max W XyYe 


(10.9) 
s.t. cle = 1, w'M\M,w =]: 

The type I LPPLS model is given to deal with the nonlinearly correlation between 
the input space X and output space Y, but linearly correlation in the input space X. 
The principal components in input space X are extracted by PCA and the principal 
components of the output space Y are extracted by LPP. The optimization function 
is 


TyT 
Jippis, (W, c€) = max w X Y ye 


10.10 
s.t. w'w = 1,c'M)Myc = 1 


in which 


1 lp 1 
Yu = SY, S, = S} S? 


1 


l l 
M, = D?Y,D, = D?" D? 


where S, and D, are similar as the S, and D, and it has a different neighbors 
parameter ô, in (9.8). 

The type III LPPLS model is given for the nonlinear correlation between the input 
space X and the output space Y as well as among the input spaces X. In this case, 
the principal components of the input space X and output space Y are both extracted 
by the LPP. Its corresponding optimization objective function is 


TyT 
JLppLs,, (W, c) = maxw Xy,Yuc 


10.11 
s.t. wM Mw = 1,c¢'M) Myc = 1. aie 


The criteria for the selection of latent vectors u; and t; for type III LPPLS are as 
follows: 


(1) The nonlinear variation on the latent vector is manifested as much as possible; 
(2) The correlation between latent vectors is as strong as possible. 


Discussion one of the aims of is to choose factors u; and t; that better represent the 
nonlinear variation of the factor changes. GLPLS’s optimization objective is given 
in (10.12) (Zhong et al. 2016). 


Joust, © = max [uT XTY + Ga" XX yw + eY Y we] 


(10.12) 
st. wlw = 1, cc=1, 


10.2 LPPLS Models and LPPLS-Based Fault Detection 177 


where the parameters (3, and (3 are the trade-off between global and local feature 
extraction. Here the embedding properties and data screening of LPP are removed 
because the constraints w! X¥'D,Xw = 1andc'Y 'D,Ye = 1 of LPP are removed 
in (10.12). GLPLS model is a fusion of the PLS model with the partial LPP model. 
“The best vectors w and c from (10.12) ensure maximum correlation (PLS) and 
relative or local optimal data filtering and embedding capabilities for X and Y (Zhong 
et al. 2016)”. On the other hand, w! X'S,Xw and cTY"S,Yc are only used to 
introduce the local features in the input and output space, but not the correlation 
features between them. However, the LPP model is fully embedded in the LPPLS 
model. It is embedded in the outer layer, inner layer or both of the PLS model, i.e., 
three types of LPPLS models. At the same time, the correlation information in the 
input and output spaces is retained. 

Type II LPPLS is used as an example to show the extracting of principal com- 
ponents. Supposed the first component pairs is (tı, u1). Define Eo, = Xm and 
Foz = Y m in order to facilitate comparison with the traditional linear PLS. 

First, the optimization (10.11) for the first component pair (tı, u1) is converted 
into an unconstrained problem by the Lagrangian multiplier, 


Y (w1, ¢1) = w Eo, Fore: — Ai (w1 My Mw, — 1) — Ax(e1" Ny Nye — 1). 


(10.13) 

Let = = 0 and = = 0, then the optimal pair of w; and c; is obtained 
E>, Fore: = 241M. M,w, (10.14) 
Fo, EorW1 = 24N; N56. (10.15) 


Equations (10.14) and (10.15) are respectively multiplied by wf and cf on the 
left, then, 
bi := 2\, = 27. = wT El, Fore: = cT F}, For. (10.16) 


Comparing (10.11) and (10.16), it is found that 6; is the objective function value. 
Substitute (10.16) into (10.14) and (10.15), and the relationship between w, and c1 
is obtained, 


1 


w; = g (MMs) ' Eo Fore: (10.17) 
al 

c= 1 NTN SFE 10.18 

b= a, yy oL H0LwW1. (10.18) 


Substitute (10.18) into (10.14) and substitute (10.17) into (10.15), the following 
equations about the first vector pair are obtained, 
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(M{M,)~' Ej, For(Ny Ny) | Fo, Eo. wi = Ow (10.19) 
(NV Ny) Fo, Eo (MI My)! Ep, Forci = 9 eK. (10.20) 


The optimal weight vectors w; and c; is obtained by the maximum eigenvalue of 
(10.19) and (10.20). Now the potential variables u; and t; are calculated as follows: 


tı = Eo, wi, u; = Fore). 
Calculation of the load vector: 


T T 
_ Forti _ Forti 


= 4:5 - 
llt |? llt |? 


Residual matrixes Ez and Fy; are 
— T _ -T 
Eiz = Eor — tipi, Fit = For — uqi. 


The first optimal weight vector w; of PLS (10.7) is the eigenvectors of matrix 
E\F oF JEo, while in LPPLS (10.19), it is corresponding to the eigenvectors of 
matrix (MTM,)' E}, Fou (NTN) Fl, Eo. The optimization problem with 
maximum eigenvalue in (10.19)are very similar to the traditional linear PLS. There- 
fore, the traditional NIPALS technique is convenient to extract the remaining prin- 
ciple components. 

The other latent variables are calculated based on the residual matrices E;z and 
Fiti HA 2, 40054 =l; 


tian = Ej,wigi, Winn = Firei+i, 


where w;+; is the eigenvector corresponding to the maximum eigenvalue 6, 1 of 
matrix (MTM)! ET, Fit (NIN, 1 FI Ei. 

Similarly, c;+ı is the eigenvector corresponding to the maximum eigenvalue 
of (N\Ny) ' Fi, Eir (M M,) 'E;,FiL. Then, 


T T 
Eizti BS Fitir 
m a fas . 
ltal > * lti+ı ll? 


Piz, = 


Finally, d latent variables of LPPLS are determined using the cross-validation 
method. 
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10.2.2 LPPLS for Process and Quality Monitoring 


X and Y is projected to a low-dimensional space by latent variables (¢),..., tq). 
The neighboring mapping of original data Eo, and Foz, is decomposed as follows: 


d 
Eo. =) tip; +E =TP'+E 


i=1 


P (10.21) 
- T - 
F=} tq +F =TQ +F, 
i=1 
where T = [f), t2,..., ta] are the latent score vectors. P = [p,,..., pq] and Q = 
[qi ---, qq] are load matrices for Eo; and Foz, respectively. T is represented by 
the neighboring mapping data E;, 
i 
T = Eo R = S; EoR, (10.22) 


where R = [r,..., ra], 
i-1 


ri = | [ (in w;p3) wi 


jal 


Similarly as GPLPLS method, (10.21) and (10.22) are difficult to apply in prac- 
tice since the locality transformation matrix S$ cannot be obtained during the online 
measurements. So they are changed to the direct decomposition of Ep and Fo, 


1 = 
Eo = S; ° (T P" + E) = ToP" + E' (10.23) 
aA 1 = = 
Fo = S5 (S T00" + F), (10.24) 


zio 
where To = EoR, E' = Sx’ E. 

Process and quality monitoring for new scaled and mean-centered data samples 
x and y is performed by the oblique projection of the input data x. 


x=X+4+x, 
¥ = RP'x (10.25) 
Xe = (I — PR”) x. 


The residual space still contains much variation information (Qin and Zheng 2012), 
but it is not the main focus of LPPLS. To facilitate the comparison with traditional 
monitoring methods, this chapter will directly adopt traditional fault monitoring 
indices without any modification. The T? and Q statistics are defined, 
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t= R'x 
1 —=1 
T? =t"A't =t" ( — TTo) t (10.26) 
n—1 
Q = llxel? = x" (Ul — PR')x, 


where A is the sample covariance matrix. The matrix X¥ or Eo, of type I LPPLS 
is not a scaled and mean-centered one. Moreover in nonlinear systems, the output 
variables may not obey the Gaussian distribution even if the input variables obey 
it. So the control limits of the statistics of T? and Q are not computed according to 
the F and x? distributions. It should be calculated based on their probability density 
functions obtained by non-parametric kernel density estimation method (Lee et al. 
2004). 


Remark 10.1 The LPPLS decomposition (10.23) is similar to linear PLS, but its 
1 
residual space E’ is related to the locally preserved projection matrix S4 . It is difficult 
1 


to obtain the locally retained projection matrix Sł for new data during online fault 
detection. But its covariance matrix A of the samples and the statistics of T? and Q 


1 
(10.26) are not directly related to the locally retained projection matrix S which is 
a useful feature for online monitoring 


ail 
Although matrix Sz := Sy* Sg € R"*" is constant, the regression equation 
(10.24) cannot be used for output projections. As mentioned above, the first rea- 


son is that the locally preserved projection matrices s? and S 3 for the new data 
are difficult to obtain. Another is that direct application of least squares solution 
Sr=E n SŁ Eo may lead to poor prediction performance. The prediction perfor- 
mance directly determines whether a model needs to be updated in practice. The 
regression equation can be constructed based on Fo and To based on (10.23), 


Fo = T00" +F. (10.27) 


Remark 10.2 In the special case of Sz = J, (10.24) and (10.27) are equal. In most 
cases, the regression coefficients (Q and Q) are significantly different. But consider- 
ing both Q and Q are least squares solutions for any type of regression equation, so 
the regression errors F and F are equivalent in theory. Therefore, the latter regres- 
sion equation (10.27) can be used to predicts the corresponding output of the new 
input data. 
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Original dataset 


Fig. 10.1 Projection results of PLS, and LPPLS models for S-curve data set with Y = 2x; — x3. 
Type I LPPLS model is used 


PLS LPPLS 


Fig. 10.2 Projection results of PLS, and LPPLS models for Swiss roll data set with Y = x1x3. 
Type IN LPPLS model is used 


10.2.3 Locality-Preserving Capacity Analysis 


Here two three-dimensional artificial data sets are used to explain the locality- 
preserving capacity of LPPLS, S-curves and Swiss roll. They are common to validate 
the performance of manifold learning algorithm. 


Xı = [x1; x2; x3] 

= [cos(a), — cos(a)]; 5v1; [sin (aœ), 2 — sin(a)] 
X2 = [x1; x2; x3] 

= [ft cos(t); 2v3; tsin(t)], 


where a = (1.5v — 1)/r, t = 30/21 + 2v4). vı, v2, v3 and v4 are uniformly dis- 
tributed on (0, 1). Two kinds of output function is defined as y = 2x, — x3 (linear) 
and y = xıx3 (nonlinear). 

1000 sample points are randomly generated in the 3-D space [x1, x2, x3], and 
the dimensionality reduction process for PLS and LPPLS model is performed. The 
projection results of the two models in two dimensions are shown in Figs. 10.1 and 
10.2, respectively. 

The projection results show that PLS does not preserve the local structural infor- 
mation for the S-curves and Swiss roll. In other words, the data is not correctly 
classified by color. However, LPPLS preserves the local structural features and has 
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good classification results. LPPLS model improves the local preserving capability of 
PLS model; moreover, LPPLS can better discriminate the boundary features. Thus, 
LPPLS method can be used to detect faults related to output variables in systems 
with strong nonlinearity. 


10.3 Case Study 


Validation of the proposed LPPLS-based fault detection method is performed on the 
Tennessee Eastman Process simulation platform (Lyman and Georgakis 1995). TEP 
is described in detail in the article found in (Lee et al. 2006). The related data sets 
are downloaded from “http://web.mit.edu/braatzgroup/links.html”. PCA (Dunia and 
Qin 1998; Good et al. 2010) and other global-local preserving projections methods 
(Luo 2014; Bao et al. 2016; Luo et al. 2016) did not merge any information in the 
output space, so only the LPPLS method and two quality-related monitoring methods 
(PLS method and GLPLS method) are compared. 


10.3.1 PLS, GLPLS and LPPLS Models 
The input variable matrix X = [x1, x2--- , X33]! consists of 22 process variables 
(XMEAS(1:22):=x, : x22) and 11 manipulated variables (x23 : x33) except XMV(12). 
The quality variable matrix Y = [y1; y2] is composed of the components G of stream 
9 and the components E of stream 11, i.e., XMEAS (35) (yı) and (38) (y2). The 
training set is the normal data IDV(0) containing 960 samples. The test set is the 
fault data IDV(1:21). Each fault data have 960 samples (the first 160 samples are 
normal and the last 800 samples are faulty). The model parameters are 6, = 1.5, 
ô, = 0.8, K, = 20 and K, = 15, where K, and Ky are the adjacent parameters 
in the input space and output space, respectively. Regression coefficients obtained 
by PLS, GLPLS, and LPPLS models are shown in Table 10.1. The relative errors 
of training are shown in Fig. 10.3. Here the relative error is calculated as error = 
(yi — Yitr)/yi, i = 1, 2 and y; sr is the corresponding output of the training model. 
The training error in Fig. 10.3 shows that the training results of the PLS, GLPLS, 
and LPPLS models satisfy the modeling requirements. The output prediction exper- 
iments of these models are finished under all the fault conditions (i.e., the test data 
set), and similar prediction abilities are obtained for most cases. Give fault IDV(21) 
as an example, the output prediction of three models are shown in Fig. 10.4. y; and 
y2 are at the top and bottom of these figures, respectively. Fault IDV(21) is caused 
by a slow drift in the output variables to drift slowly (Lee et al. 2006), but the pre- 
diction performances of three methods still are good even in this fault case. So the 
generalization capability of three models is verified. 
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Table 10.1 Regression coefficients of PLS, GLPLS, and LPPLS models 
PLS GLPLS LPPLS 
yı y2 yı y2 yı y2 
18.1489 —0.7162 13.6677 —3.0777 212.8754 —77.0014 
xy —0.0593 0.0855 0.0387 0.0392 —0.0496 0.0932 
x2 0.0000 0.0000 0.0001 0.0000 —0.0001 0.0000 
X3 —0.0001 0.0000 0.0000 0.0000 —0.0001 0.0000 
x4 0.0261 —0.0149 0.1011 —0.0058 0.0271 —0.0182 
x5 —0.0055 0.0015 0.0058 0.0046 0.0000 0.0030 
x6 0.0003 —0.0009 0.0041 0.0000 —0.0056 —0.0007 
x7 —0.0009 0.0000 —0.0009 0.0003 —0.0002 —0.0003 
xg —0.0013 0.0000 —0.0125 0.0003 —0.0061 —0.0003 
x9 —0.0656 0.0229 —0.1396 —0.0028 —0.1016 0.0447 
X10 —0.0946 0.0128 —0.0293 0.0440 —0.4048 0.0257 
X11 0.0223 —0.0027 0.0240 —0.0007 0.0296 0.0000 
x12 —0.0009 0.0002 —0.0008 —0.0008 —4.0733 1.5519 
X13 —0.0009 0.0000 —0.0005 0.0002 0.0003 —0.0001 
X14 0.0005 0.0002 0.0018 —0.0001 0.0000 0.0001 
X15 0.0007 —0.0004 —0.0004 0.0001 —0.8701 0.5530 
X16 —0.0011 0.0001 —0.0009 0.0004 —0.0031 0.0002 
X17 0.0007 0.0000 0.0016 —0.0001 —0.2341 0.0077 
X18 0.0101 —0.0051 —0.0220 0.0039 —0.0251 —0.0167 
X19 0.0001 —0.0001 0.0005 0.0000 0.0001 —0.0001 
X20 —0.0001 —0.0020 0.0076 —0.0025 —0.0005 —0.0012 
x21 0.0145 0.0035 0.0949 0.0218 —0.0094 0.0074 
x22 0.0044 —0.0036 0.0152 0.0026 —0.0033 —0.0054 
X23 —0.0043 0.0008 0.0017 0.0069 —0.0047 0.0010 
X24 —0.0040 —0.0019 0.0106 —0.0030 —0.0044 —0.0024 
X25 —0.0006 0.0009 —0.0001 0.0005 0.0001 0.0005 
X26 —0.0003 —0.0002 0.0000 0.0008 0.0006 —0.0003 
X27 —0.0053 —0.0039 —0.0095 —0.0027 —0.0146 —0.0042 
X28 —0.0007 0.0003 —0.0034 —0.0003 0.0011 0.0002 
x29 —0.0003 0.0001 —0.0003 —0.0003 1.3836 —0.5273 
X30 0.0003 —0.0002 —0.0002 0.0000 0.3753 —0.2391 
X31 0.0004 —0.0005 —0.0004 —0.0001 0.0054 0.0007 
X32 —0.0017 —0.0004 0.0046 —0.0017 0.0022 —0.0010 
X33 —0.0007 0.0002 0.0001 —0.0001 —0.0990 0.0037 
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Fig. 10.3 Relative errors of PLS, GLPLS, and LPPLS models 
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Fig. 10.4 Prediction results for IDV (21) of PLS, GLPLS, and LPPLS models 


10.3.2 Quality Monitoring Analysis 


The T? statistic represents the mapping between process variables and quality vari- 
ables for PLS and its related methods. The alarm in T? statistic indicates a quality- 
related fault. In contrast, the Q statistic represents only the residuals in the input 
space, therefore, its alarm indicates that the fault is not quality related. Table 10.2 
gives the monitoring FDR whose control limits are calculated with confidence level 
99.75%, respectively. 

The product quality consists of component G (XMEAS(35)) and component E 
(XMEAS(38)). Faults IDV(3,4,9,11,14,15,19) have almost no effect on product qual- 
ity, but the remaining faults cause significant changes in the quality variables. The 
FDR results of the LPPLS method match the above actual TPE case, which detects 
quality-related faults with much higher accuracy than the PLS and GLPLS models 
(e.g., IDV(5) and IDV(12) in Table 10.2). In this section, the performance for fault 
detection is further examined based on three fault scenarios, including disturbance 
of reactor cooling water, disturbance of condenser cooling water, and a constant 
position of the steam 4 valves. 

Experiment 1: Disturbance in Reactor Cooling Water (Quality-Independent 
Fault) 

The faults related to the reactor cooling water are IDV (4), IDV (11), and IDV (14). 
As mentioned above, they have little effect on the product quality but are process 
related. The results of monitoring the variation of the reactor cooling water are 
shown in Fig. 10.5. Here IDV (14) is given for example in order to compare with 
other quality-related methods, such as GPLPLS given in Chap. 9. 
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Table 10.2 FDR of PLS, GLPLS, and LPPLS models 
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PLS GLPLS LPPLS 
T Q T? Q T2 

IDV(1) 99.13 99.38 99.75 99.38 98.63 
IDV(2) 98.00 98.25 97.63 98.25 98.13 
IDV(3) 0.38 0.13 1.13 0.63 0.50 
IDV(4) 0.63 86.00 98.88 67.00 0.25 
IDV(5) 21.88 16.00 21.38 22.25 99.63 
IDV(6) 99.25 100.00 99.38 100.00 100.00 
IDV(7) 36.75 100.00 83.63 100.00 37.63 
IDV(8) 92.50 94.00 93.38 97.13 92.25 
IDV(9) 0.63 0.50 0.75 0.88 0.63 
IDV(10) 30.00 4.38 23.13 26.63 49.00 
IDV(11) 1.38 57.88 53.50 52.50 2.88 
IDV(12) 87.50 91.00 87.75 97.88 95.50 
IDV(13) 93.88 93.00 95.25 94.25 94.13 
IDV(14) 33.50 100.00 96.88 99.88 2.50 
IDV(15) 0.63 0.38 1.50 0.88 0.75 
IDV(16) 14.25 3.13 9.00 12.50 53.38 
IDV(17) 56.00 85.38 96.75 85.25 52.75 
IDV(18) 88.00 89.25 90.25 89.88 87.88 
IDV(19) 0.00 4.13 2.50 1.63 3.25 
IDV(20) 26.63 34.00 36.25 35.38 28.13 
IDV(21) 29.88 39.75 44.38 33.75 42.38 
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Fig. 10.5 PLS, GLPLS, and LPPLS monitoring for IDV (14) 
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The faults related to the reactor cooling water will cause the variation of reactor 
temperature, but the reactor temperature is controlled by a cascade controller. So any 
disturbances, including step fault IDV(4), random fault IDV(11), and valve sticking 
disturbances IDV(14), do not affect the product quality. Table 10.2 shows the fault 
detection rates for the PLS, GLPLS, and LPPLS methods. The Q statistics of all 
three methods detect these process-related faults in the input space with higher FDR. 
The FDR values for LPPLS for the T? statistic are much smaller than other methods, 
which indicates that these faults are quality-independent. Fault IDV(14) is a special 
case. When the traditional analysis methods, such as filtering or PLS, are applied to 
this fault, most information about the fault feature are lost. This leads to this fault 
is difficult to detect in the input space, thus preventing it from detecting the fault 
in the input space. Now Let’s check the detection result for fault IDV(14). FDR in 
the T? statistic for PLS and GLPLS model are 33.5% and 96.88%, far higher than 
LPPLS. It means that PLS and GLPLS distinguish fault IDV(14) as quality related. 
The FDR of LPPLS in T° statistic is 2.5%, near to that of GPLPLS (Tables 9.2 and 
9.3). So LPPLS can effectively filter the quality-irrelevant faults, similar as GPLPLS 
method. 
Experiment 2: Disturbance in Condenser Cooling Water (Quality-Related 
Fault) 
These faults include the quality-related faults IDV (5) and IDV (12). The fault IDV 
(5) is caused by a step change in the cooling water flow rate of the condenser. Since 
the series controller compensates for this step change, the separator temperature 
returns to setpoint. The PLS and GLPLS have similar predicted results, returning to 
the setpoint 10h after the fault. But LLPLS-based monitoring provides a persistent 
alarm in statistic (T?) (Fig. 10.6). “The persistence of the fault detection statistic is 
demonstrated by the fact that it continues to alert the operator to process anomalies 
even though all process variables appear to have returned to their normal values, 
especially important in quality-related process fault detection (Lee et al. 2006)”. 
In fact, the disturbance in condenser cooling water, such as its flow rate, always 
affects the output quality. It should be pointed that the cooling water flow rate of the 
condenser plays an important role both in the output quality and the safety of the 
chemical plant. This fault cannot be eliminated by the series controller and should 
be alarming. Although the controller can compensate the variations caused by this 
fault, the process-related monitoring in Q statistic, (Fig. 10.6), provides a consistent 
alarm. Experimental results show that the PLS and GLPLS models do not actually 
capture the source of the fault, while LPPLS does. 
Experiment 3: Constant Position in Valve of Steam 4 
Fault IDV (21) due to the slow output drift has been little studied. The sensitivity of 
fault detection is related to the magnitude of the mass drift. Therefore, fast detection 
of fault IDV(21) is beneficial for quality control. The process monitoring results 
are shown in Fig. 10.7. For GLPLS, LPPLS, and PLS, this fault is fully detected 
as quality-related after about 650, 720, and 780 samples, respectively. LPPLS and 
GLPLS detect the fault IDV(21) faster than PLS method. 

The following conclusions are drawn from the above experiments. 
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Fig. 10.6 PLS, GLPLS, and LPPLS monitoring for IDV (5) 
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Fig. 10.7 PLS, GLPLS, and LPPLS monitoring for IDV (21) 


PLS is a linear model, so it cannot accurately identify some faults for the strong 
nonlinear systems. 

GLPLS and LPPLS shows better extracting for nonlinear correlation by introduc- 
ing the locality-preserving ability of LPP strategy. 

GLPLS aims at preserving the local features in the input space and output space, 
but lacks the correlation between them. GLPLS is actually a linear PLS plus partial 
locality preserving, in which the role of LPP is not fully reflected. This may lead 
to the false detection or missed detection in fault detection. 

LPPLS makes full use of the LPP algorithm to achieve local nonlinear structure 
preservation. It decomposes the global nonlinear problem into a combination of 
multiple local linear problems by introducing local structure information. There- 
fore, LPPLS establishes an more effective model for the nonlinear correlation 
between the input space and the output space compared with GLPLS. 


10.4 Conclusions 


In this chapter, the LPPLS statistical model is proposed and the LPPLS-based quality- 


related fault detection and prediction is given. LPPLS not only retains the local 


information of the original data, but also maintains the correlation between X and 


Y 


to the maximum extent, thus achieving accurate prediction of quality variables. 


The LPPLS encapsulates the excellent detection performance for locally nonlinear 
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systems, due to the local feature extraction ability controlled by two parameters, ôx 
and 6,. Experiment results on the artificial three-dimensional data sets, S-curve and 
Swiss roll, show that LPPLS maintains local structural features well. The experiment 
results on TEP simulator show that LPPLS extracts the local nonlinear features more 
effectively and has better fault detection performance than PLS and GLPLS models. 
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Chapter 11 ®) 
Locally Linear Embedding Orthogonal ly 
Projection to Latent Structure 


Quality variables are measured much less frequently and usually with a significant 
time delay by comparison with the measurement of process variables. Monitoring 
process variables and their associated quality variables is essential undertaking as it 
can lead to potential hazards that may cause system shutdowns and thus possibly huge 
economic losses. Maximum correlation was extracted between quality variables and 
process variables by partial least squares analysis (PLS) (Kruger et al. 2001; Song 
et al. 2004; Li et al. 2010; Hu et al. 2013; Zhang et al. 2015). In order to deal with the 
nonlinear correlation of industrial data, this chapter proposes another two nonlinear 
PLS methods, named as Local Linear Embedded Projection of Latent Structure 
(LLEPLS). LLEPLS is an oblique projection on the input data space. By further 
decomposing the LLEPLS model, Local Linear Embedded Orthogonal Projection 
of Latent Structure (LLEOPLS) is proposed which the orthogonal projection on the 
input space is obtained. LLEPLS or LLEOPLS also extracts the maximum relevant 
information and preserves the local nonlinear structure between input and output 
simultaneously. 

LLEPLS or LLEOPLS project the input and output space into three subspaces 
from the view of statistical analysis: (1) joint input-output subspace, aiming at finding 
the nonlinear relationship between the input and output. It also can be used for qual- 
ity prediction. (2) output-residual subspace, aiming at monitoring the quality-related 
fault which cannot be predicted from the process data. (3) orthogonal input-residual 
subspace, aiming at identifying whether the predictable fault is quality related. 
The corresponding monitoring strategies are established based on the LLEPLS and 
LLEOPLS model, respectively. 
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Fig. 11.1 Outer- and 
inner-model presentation for 
PLS decomposition m d m 
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11.1 Comparison of GPLPLS, LPPLS, and LLEPLS 


PLS has a better performance compared to PCA in quality-relevant faults. As shown 
in Fig. 11.1, the output space (Y) and input space (X) are decomposed for the PLS 
model. Here the external relationship is the “foundation” and the internal relationship 
is the “result”. For nonlinear PLS, the desired “results” cannot be obtained by internal 
structure adjustment (Zhang and Qin 2008), if the external relationships are linear. 
Therefore, it is possible to build better internal relationships by starting with the 
analysis of external relationships. The nonlinear function usually is approximated 
by a series of locally weighted linear model. For example, (Wang et al. 2014; Yin 
et al. 2016, 2017) use the locally weighted projection regression (LWPR) or few 
univariate regressions to learn the nonlinearity of external relationships. This PLS 
regression can be considered as multi-KPLS regression with Gaussian kernel to some 
extent. 

The location-preserving partial least squares (LPPLS) model (given in Chap. 10) 
is another external nonlinear PLS model and its structure is relatively simple com- 
pared to the KPLS model (Wang et al. 2017). However, the LPPLS model has at least 
two limitations. The first one is that the local geometric structure (uniform weights) 
cannot be preserved better, or the o parameter (Gaussian weights) (Kokiopoulou and 
Saad 2007) is difficult to be selected properly. The second is an oblique decomposi- 
tion of the measurement process variables. The LPPLS model extracts the principal 
components and retains local structure by locality-preserving projection (LPP). LLE, 
another nonlinear dimensionality reduction technique, transforms the global nonlin- 
ear problem into a combination of several local linear problems by introducing local 
geometric information. Compared with LLE method, the local preserving strategy of 
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LPP is more complex, and its parameters (Gaussian weights)are more and not easy 
to tuned. 

The global plus local projection to latent structure (GPLPLS) (given in Chap. 9) 
integrates the advantages of PLS and LLE methods. The distinctive feature of the 
GPLPLS model is that the local nonlinear features are enhanced by LLE in the PLS 
decomposition (Zhou et al. 2018). GPLPLS uses the strategy of plus but not embed- 
ding, in which the new feature space is divided into linear part (global projection) and 
nonlinear part (local preserving). It confirms that the LLE plus PLS algorithm is able 
to perform the decomposition of the input and output space, and effectively preserve 
the local geometric structure. However, this combination needs further research, such 
as how to combine more effectively, how to make the orthogonal decomposition be 
completed, and also how to quantitatively evaluate the monitoring effect. 

Based on the above analysis, Local Linear Embedded Projection of Latent Struc- 
ture (LLEPLS) is proposed. It extracts the maximum correlation information between 
input and output, at the same time reveals and preserves the intrinsic nonlinear struc- 
ture of the original data. The principal components of the input space (or measured 
variables space) extracted by LLEPLS still contain the variations orthogonal to Y. 
These variations are output irrelevant and do not contribute to the output prediction. 
Moreover, LLEPLS is an oblique projection on the input space. Orthogonalization 
is an alternative solution for these issues. Then the local linear embedded orthogo- 
nal projection to latent structure (LLEOPLS) model is proposed in order to explain 
further the LLEPLS prediction model and detect quality-related faults. LLEOPLS 
eliminates the T? statistic including variations orthogonal to the output. LLEOPLS 
differs significantly from other existing nonlinear PLS models in orthogonal projec- 
tions with local geometric structure preservation and less easily fixed parameters. 


11.2 A Brief Review of the LLE Method 


Given the normalized data set X =[x"(1),x"(2),..., xT(n)]" eRe", 
(x = [x1, X2, . . -, Xm] € R'*”) of the model, where n is the sampling time and m is 
the number of input variables. LLE algorithm introduces the local structural infor- 
mation and transforms the global nonlinear problem into a combination of multiple 
local linear problems. It is outstanding at the locally nonlinear processes. 

The size of neighborhood kx is crucial for the local geometric structure. According 
to the distance measures such as Euclidean distance, the K nearest neighbors (KNN) 
of the sample can be selected (Kouropteva et al. 2002), 


sop = arg min(1 — Ph.D, )» (11.1) 
where D, and Dy, denotes the distance matrices (between point pairs) in X and 


®,. (®, given in (11.4)), and p denotes the standard linear correlation coefficient 
between D, and Dg,. 
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Next, the ky nearest neighbors of the sample x (i) can be obtained. Then x (i) can 
be linearly expressed based on its the k, nearest neighbors x(j) by the following 
optimization object, 


2 
n kx 
J(Ay) = min X Ix — Yo ajax) 
= = (11.2) 


ky 
s.t. J Gij,x = 1, 
j=l 


where [a;; s] := Ay € Rk (i =1,2,...,n, j= 1,2, ..., kx) denotes the weight 
coefficients. Usually, points belonging to the space X are projected onto a new 
low-dimensional reduced space ®, = [1 (1), , (2), cei pE] ER", (d< 
m, , € R'*“) determined by the following optimization: 


n ky 
JeW) = min X [00 — Yo aij (A) 
i=l j=l 


(11.3) 
st. lo, =I. 
In order to further analysis, a linear mapping matrix W = [w1, ..., wa] € R”*® is 
introduced with the guarantee of local embedding, 
(i) =x(i)W, G@=1,2,...,n). 11.4 
where w;, j = 1, ..., d denotes the projection vector. Then the optimization (11.3) 
is rewritten as 
Jar (W) = mintr (W'XTMTM, XW) uis 


st. WTXTXW =1, 


where M, = (I — A,) € R”*". SVD operation is performed on M, in order to 
simplify the dimensionality reduction problem, 


M, =[|U; arie o| Lv. 


Then, the minimum value problem (11.5) is changed as follows: 


Jite(W) = max tr (W'X),XuW) 


11.6 
st. WIX'XWe= lI, aaa 
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-1 
where X y := Sz o 4 X = Sy, X. Generally, LLE should chose the reduced 
dimension d in (11.3) in advance, but PCA can determine the corresponding dimen- 
sion based on the specific criteria such as the cumulative contribution. The optimiza- 


tion problem (11.6) is further rewritten, 


Jire (w) = max wX, Xuw 


(11.7) 

st. wiX'Xw=l, 
where w € R”*!, The criteria of determining the number of principal components 
in PCA can be directly applied to LLE. Based on the SVD algorithm, the matrix X m 
is decomposed into a “load matrix” Pg = [ P1, P2, ---, py] and a “score matrix” 
Ta =(t1, to,..., ta] 


Ay P 
XI Xm = [Pa Pa| oy Ea 


and defined Py = Pyo/||X Pooll, P, = P,o/||X Proll, and 


Xy=TqP\+T,P! 


= PyP}Xut+(1— PaP?) Xm, 
where Ty = Xm P4, T, = XmP,-. 

It is observed from (11.7) and (11.8) that the projection direction of LLE can be 
obtained by maximizing the variance. Thus, the LLE constructs anew PLS regression 
with the local geometric structure-preserving ability according to the component 
extraction criteria. 

Variance (factor variation) is used to extract the latent variables in PLS algorithm. 
It transforms the original data X and Y into a set of t-scores T and u-scores U. 
The latent factors T and U are chosen by maximizing the factor variation. It aims 
at using fewer dimensions but retaining more features of the original data. PLS 
is a linear dimensionality reduction technique, but does not explore the intrinsic 
structure of original data. It is not conducive to data classification, but may make 
data mixed together. The phenomena that may occur with PLS are given in Fig. 11.2, 
similar as the PCA. Figure 11.2a shows a two-mode data space X and Fig. 11.2b 
give its first principal component tı in PCA. The contribution of the first principal 
component of tı is 99%. As shown in Fig. 11.2b, the blue ‘o’ and black ’*’ points in 
the one—dimensional coordinate system are mixed together. The second principal 
component is discarded due to its small contribution although it maintains the local 
geometric structure. 
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Fig. 11.2 PCA 
decomposition and its project 
of a two-mode data space 
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(b) First principal component tı in PCA 


11.3 LLEPLS Models and LLEPLS-Based Fault Detection 


11.3.1 LLEPLS Models 


In order to extract the first component pair (¢;, u1), the traditional PLS optimization 
is expressed as 


Jes (w1, c1) = max w] X "Yc; 
T T (11.9) 
s.t. ww; = 1, cici = 1. 


Define Ey = X and Fo = Y. The PLS latent variables tı and cı of are obtained from 
tı = Eow, and u; = Foc. Here cı and w are the eigenvectors corresponding to 
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the maximum eigenvalues of matrices, 


E! FoFjEow, = Pw, (11.10) 
FEE} Foc; = 6c). (11.11) 


Locality linearly embedded partial least squares (LLEPLS) is proposed to opti- 
mize the function as follows: 


TyT 
JiLepLs (w1, c1) = max wj X yY mci 


(11.12) 
st. w X'Xw,;=1,c;Y'Ye; = 1 


in which, 


where A, is accompanied by its neighbors with different parameters ky, similar as 
Ax. Sy, Vy and U, are also similar to S+, Vy and U,. 

The criteria of LLEPLS component decomposition and latent factors extraction 
are given as follows: 


(1) The latent factors u; and t; are chosen to maximize the nonlinear variation of 
the factors (by local linear embedding). 

(2) The correlation between potential factors u; and t; should be as strong as pos- 
sible. 


Then, the latent variable calculation process of LLEPLS model is given as fol- 
lows. Denote Eo, = X m and Foz = Y m, similar as the traditional linear PLS. The 
constrained optimization problem (11.12) is transformed by introducing a Lagrange 
multiplier vector, 


y (wi, C1) =w El, Fores = Ài (wi X'Xw, = 1) 


11.13 
— 2 (ef Y'¥e, — 1). g 


The optimal w; and cı is solved by = = 0 and g = 0. Next, the optimization 


problem (11.13) is solved by the maximum eigenvalue problem, 
(XTX) E}, Foz (TY) Fh, Eo.wi = w: (11.14) 
-1 -1 
(YTY) Fo, Eor(X'X) E>, Fore: = fci. (11.15) 


The first optimal weight vector w; in the conventional linear PLS (11.10) is cor- 
responding to the matrix E\F oF TEo. For the LLEPLS (11.14), the optimal w; is 
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Fig. 11.3 Outer- and inner-model presentation for LLEPLS decomposition 


derived from the corresponding matrix (x? El, For (rir). Fl, Eox- These 
matrices are particularly similar. The extraction and modeling of the residual com- 
ponents can be done by traditional PLS methods. 

It is worth pointing out that the columns of the input space X and/or the output 
space Y may not be full rank. The inverse of XTX and/or YTY does not exist. Similar 
as the S, in (11.6), the corresponding matrix inverse can be obtained for X and/or Y. 
It does not affect the following analysis, so both cases will be treated indiscriminately 
in the rest of this chapter. 

The first d components are obtained to predict the regression model, where d is 
determined by cross-validation tests. Similar to the outer- and inner-model presen- 
tation for PLS decomposition, the corresponding LLEPLS decomposition is shown 
in Fig. 11.3. It is found that that the new feature space X and Y p are both con- 
structed by the nonlinear part, i.e., the local structure information. Compared with 
the decomposition of GPLPLS shown in Fig. 9.2, the global linear part is eliminated. 


11.3.2 LLEPLS for Process and Quality Monitoring 


The linear localization embedding in the low-dimensional space of X and Y is formed 
by few latent variables (t1, ..., t4) in the LLEPLS model. The neighborhood map- 
pings of Eoz and Foz are decomposed as follows: 
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d 
Eo. = Yo tip) + Eor =TP'+ Eo; 


i=1 


a (11.16) 
Fo. = > tig? + Fo, =TQ'+ For, 
i=1 
where T = [t,, t2,..., t4] denotes the score vectors, P = [Pi Pas Pal and Q = 
CIE -4 al denote the loading matrices of Eo; and Foz, respectively. Score T is 
represented in terms of the neighboring mapping data E;, 
T = EoL R = Sy, EoR, (11.17) 


where R = [r1,..., rq], and 
i-1 
ri = | [Un - wj pp wi. 
j=1 


Equations (11.16) and (11.17) are difficult to directly apply in practice due to the 
calculation of locality-preserving matrix S, so the decomposition for the scaled and 
mean-centered Eo and Fo are given, 


Eo = ToP" + Eo (11.18) 
Fo = To o` + Fo 
= PRO + Fo, (11.19) 


where To = EoR, Q = T{ Fo. 

Now let’s consider the monitoring of new samples x and subsequently on y. First 
the samples are scaled and mean-centered, an oblique projection is derived on the 
input data space x. 


x=£+4+x, 
X= PR'x (11.20) 
xe = (I — PR”) x. 


The statistics T? and Q are calculated as follows: 
t = R'x 
1 -1 
Tata ait (131) t (11.21) 
n= 
Q = |ixel? = x" (T — PR”) x, 


where A is the sample covariance matrix. 
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The space of measured variables, i.e., input space, is divided into two subspaces: 
score subspace and residual subspace. LLEPLS detects the quality-related faults by 
the T? statistic in the score subspace and detects the quality-irrelevant faults by Q- 
statistics in the residual subspace. The PLS scores which constitute the T? statistic 
still includes the variation orthogonal to Y. Therefore, LLEPLS still has deficiencies 
in the quality-related fault detection. 


11.4 LLEOPLS Models and LLEOPLS-Based Fault 
Detection 


As demonstrated in (Li et al. 2010), (Ding et al. 2013), the standard PLS performs 
a diagonal decomposition of the measured process variables. The LLEPLS model 
(11.16) also is a oblique decomposition operation (11.20) on the measured process 
variables, which is similar to the standard PLS model. Thus, the major part of the 
measured process variables may include variations orthogonal to the output variables. 
In other words, the principle component still include the output irrelevant variation, 
and the residual part may include a large of output-related variation. In addition, 
the number of principal components is often dependent on the operator’s decision 
and is likely to cause the problems of component redundancy. In order to solve 
these problem, it is necessary to further decompose the LLEPLS model in equation 
(11.18) and get an orthogonal decomposition for the measured process variables. In 
this model, the regression coefficient R Q' in equation (11.19) are used to describe 


the relationship between Eo and Fo. Performing the SVD operation on R Q' to 
obtain orthogonal decomposition, 


-T 

RQ =U pS peV per (11.22) 

where Spc contains all non-zero singular values in descending order. V ,, and U pe 

are the corresponding right and left singular vectors. Then, 

Fo = EU peS peV ne F Fo 

, = (11.23) 

= T pc Q nc + Fo, 

where T pe = EoU pe, Q pe = V pcS pc. The output-residual subspace Fo indicates 
an unpredictable output but may include some variation. 

Furthermore, Eo decomposes into two orthogonal subspaces by T pc. 


Eo = Êo + X. 


(11.24) 
= T peU pe + Eo (I = U pcU he) , 


where Êo = ToU; and X. = Eo (I — U pUe): Xe denotes the orthogonal 


pe 
input-residual subspace. The new data samples x and subsequently y are 
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orthogonal projected on the input data space x for process and quality monitoring, 


x=xX4+x, 

PSU pU 

xe = (I — U pU 5e) * (11.25) 
tye =U pX 


Je=5 y- Q pct pc- 


The LLEOPLS model is given in (11.23) and (11.24) with many parameters to be 
determined in prior. The selection of the optimal parameters has been described for 
LLE (Kouropteva et al. 2002). The optimal parameters [k,, ky] of LLEOPLS model 
is determined by simultaneously considering the characteristics of the LLE itself and 
the relationship between the input and output spaces. The following optimization is 
given for determining the parameters [k,, ky]: 


[kes ky loge = are min (1 = oh, oa +1- Pb, 
” (11.26) 


3 
pre ) 


where ĵ = Q pet pe- ‘lirain aNd -| pre are the training data set and the testing data 
sets, respectively. The first two terms in (11.26), 1 — Poi: and 1 — Pb D, > aim 


+1— PD;D, +i- Pb:D, 


train 


at evaluating the geometric similarity between the embedding space and the high- 
dimensional space. The last two terms, 1 — Po, p and 1 — Pan p » indicate the effect 
of the model which indirectly reflects the role of the first two terms. Cross-validation 
is used to ensure the training results of the model. The last term is the most important 


part in (11.26), 
) : (11.27) 
pre 


A generalized LLEOPLS model with the optimal parameters ky and ky can be 
used to monitor the operation of the system. The T? statistics can monitor the output- 
related score (T pc), output-residual part and input-residual part, 


— à 2 
[kx, ky] oot = arg an (1 = PD;D, 


1 =1 
2 T =1 T T 
T = t pcApe tpc = tpc [HTT | Í pe 


2— „T4-—l1 2 P 
Pea 


= 
xix) m (11.28) 


n—l1l 


1 -1 
E RA Yes 
ys y n— 


200 11 Locally Linear Embedding Orthogonal Projection to Latent Structure 


where Apc, Ax e and A, e denotes the sample covariance matrices. Y, := F o= 
Fo — T pe Qpe 

The T ,,, of the LLEOPLS method is not obtained from a scaled and mean-centered 
matrix Eoz. The control limits of the T? statistical series usually are calculated based 
on the probability density function estimated by the non-parametric KDE method. 
The Toe and T? statistics both are univariate although the processes represented by 
these statistics are multivariate. Then the control limits for the monitoring statistics 


a T? and T?) are obtained from the corresponding PDF estimation, 


Thpc,a 
f gT? )d T. =a 


(oe) 


Thye.a 
J g(T2)dT? = a 


[0,0] 


Thye,a 
I gT? aT, =a, 


where 


o1 : ‘aes 
00= LK ; | 


where K(-) and h are kernel function and its bandwidth or smoothing parameter, 
respectively. 
Finally, the fault detection logic for the output-residue subspace is given, 


T? e > Thy. Unpredictable output faults 
ys (11.29) 
Ts < Thy, Fault-free in unpredictable output. 


Te includes the output information, so it is suitable for monitoring the output- 
residual subspace. But this posteriori quality monitoring is not the focus. Instead, 
process-based quality monitoring is of greater interest. Fault detection logic for the 
input space is (Zhou et al. 2018): 


Tos > Thpc,a Quality-relevant faults 
T > Thpc,a OF T? > Th,,.. Process-relevant faults (11.30) 
Toe < Theat T? < Thy,» Fault-free. 


The monitoring process of LLEOPLS algorithm for the complex industrial system 
is given as follows: 
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1. The original data X and Y is scaled to zero mean and unit variance. 

2. The LLE and PLS optimization objectives ((1 1.4) and (11.9)) are combined. Then 
perform the LLEPLS operation for X and Y to yield To, Q and R as well as the 
output-residual subspace Y,, based on (11.18) and (11.19). 

. The number of LLEPLS factors d is determined by cross-validation. 

. Perform SVD on R Q`. Further access to U pe, T pe and Q je 

. Build the input-residual subspace X.. 

. Calculate the control limits (11.28) and finish the fault monitoring according to 
the fault detection logic (11.30). 


Nn W 


11.5 Case Study 


The fault detection strategy based on the proposed LLEPLS and LLEOPLS model 
is performed on the Tennessee Eastman Process (TEP) simulation platform (Lyman 
and Georgakis 1995). To better demonstrate the effectiveness and rationality of the 
proposed monitoring strategy, the PLS monitoring strategy and the concurrent pro- 
jection to latent structure (CPLS) model (Qin and Zheng 2012) are compared. With 
the CPLS algorithm, the input and output spaces are projected into five subspaces: 
the input-principle subspace, the input-residual subspace, the output-principle sub- 
space, the output-residual subspace, and the joint input-output subspace. When only 
the monitoring capability of quality-related faults is considered, the input-residual 
subspace replaces the input-residual and -principle subspace in the CPLS model. 
The TŻ replaces the corresponding monitoring strategy. In order to emphasize the 
process-based quality monitoring, the output-residual subspace in LLEOPLS model 
will not be considered. Similarly, the output-principle and -residual subspaces in 
CPLS model are not considered. 


11.5.1 Models and Discussion 


All process measurement variables (XMEAS (1:22)) and manipulation variables 
(XMV (1:11)) form the input variables matrix X. The quality variable matrix Y con- 
sists of XMEAS (35) and (38). The training data set is normal data IDV(0) and the 
texting data consists of the 21 fault data IDV(1-21). The optimal parameters of LLE- 
PLS and LLEOPLS are k, = 24 and k, = 20. The number of principal components 
of the PLS, CPLS, LLEPLS, and LLEOPLS models are 6, 6, 5, and 5, respectively. 

From the analysis of previous Chaps.9 and 10, it is known that faults IDV(3,4), 
IDV(9,11), IDV(14,15), and IDV(19) had almost no effect on product quality but 
other faults produced significant variations in quality variables when select compo- 
nent G (XMEAS(35)) and component E (XMEAS(38)) as product quality variables. 
The FDR and FAR of PLS, LLEPLS, CPLS, and LLEOPLS at the control limit 
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Table 11.1 FDR of PLS, LLEPLS, CPLS, and LLEOPLS 


PLS CPLS LLEPLS LLEOPLS 
IDV T? T PQAR 
1 99.75 |28.25 
2 98.25 | 77.00 
3 1.75 | 0.88 
4 100.00 | 0.25 
5 100.00 | 21.88 
6 100.00 | 95.25 
7 100.00 | 33.88 
8 97.88 | 75.38 
9 1.88 | 0.88 
10 88.25 | 17.13 
11 77.75 1.38 
12 99.88 | 81.00 
13 95.25 | 85.13 
14 100.00 | 0.00 
15 3.88 | 0.25 
16 91.25 | 8.63 
17 96.75 | 5.63 
18 90.25 | 86.38 
19 0.00 4.13 | 0.00 91.13 | 0.13 0.13 0.38 91.38 | 0.25 
20 26.63 34.00 | 27.75 90.38 | 22.75 |19.88 | 11.25 90.88 | 4.38 
21 29.88 39.63 | 24.50 43.88 |33.50 | 19.63 | 16.25 53.75 | 16.75 


with confidence level 99.75% are shown in Tables 11.1 and 11.2, respectively. Based 
on the two tables, the monitoring results for LLEOPLS are a little different from 
the other monitoring results which are almost the same as FAR, such as IDV(14) 
and IDV(17). They are considered as quality-related faults in the method of PLS. 
However, LLEOPLS method indicates that they are quality-irrelevant faults. 

Which monitoring results are more credible? The following is given to assess 
whether the final result of the fault detection is reasonable by quantifying the posterior 
quality alarm rate (PQAR). 


_ No. of samples (IŒ -)|} > 3 | f #0) 
PQAR = eel canplee G 0) x 100, (11.31) 


where Y p are the scaled and mean-centered data, which is the output data of the 
fault cases. The PQAR is also given in Table 11.1. The 21 faults are divided into two 
categories by PQAR. Type I is quality-independent (PQAR; < 6,i = 1,2,...,21), 
including IDV(3,4,9,11,14,15,17,19,20). Type Ilis quality-relevant faults, and further 
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Table 11.2 FAR of PLS, LLEPLS, CPLS, and LLEOPLS 
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PLS CPLS LLEPLS LLEOPLS 
IDV T° Q T? T T? Q The T 

1 0.63 
2 0.00 
3 3.13 
4 0.63 
5 0.63 
6 0.00 
7 0.00 
8 0.00 
9 0.63 
10 0.00 
11 0.63 
12 0.63 
13 0.00 
14 0.00 
15 0.63 
16 1.25 
17 0.63 
18 1.25 
19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.63 
20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
21 1.25 0.63 0.00 2.50 0.63 0.63 0.63 1.88 


classified into three categories: IDV(16) has a slight effect on quality; IDV(1, 2, 5, 
6, 7, 8, 10, 12, 13, 18) has a serious effect on quality; and IDV(21) causes a slow 
drift of the output variable. Apparently, the LLEOPLS method achieves a consistent 
conclusion Ta). That is, the LLEOPLS model can eliminate the quality-independent 
interference alarms better. However, there are still some differences in alarm rates 
between PQAR and Tiis such as IDV (5), IDV (7), and IDV (20). What causes 
this difference? Next, the differences between the LLEOPLS method and the other 
methods are further analyzed based on the PQAR and T3, alarm rates. 


11.5.2 Fault Detection Analysis 


The differences in fault detection results are discussed for the PLS (CPLS) model and 
the LLEPLS (LLEOPLS) model, respectively. Several cases exist for output variables 
or process variables with no faults or minor faults IDV(3,9,15)). Both approaches 
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Fig. 11.4 PLS, LLEPLS, CPLS, and LLEOPLS monitoring result for IDV(1) and the output pre- 
dicted values 


provide consistent conclusions. For other faults, there are some differences in their 
diagnostic results. For two failure cases, including quality-recoverable failures and 
quality-irrelevant failures, the analysis is as follows. Subplots (a-d) of Figs. 11.4, 
11.5 and 11.6 are monitoring result based on the statistics The and T?, respectively. 
The blue line shows the monitored value and the red dashed line shows the control 
limit of 99.75%. In the corresponding subplots (e) and (f) give the output prediction, 
where the blue dashed line is the measurement value and the green line is predicted 
value. 


Experiment 1: Quality-Recoverable Faults 


Consider the fault IDV(1), IDV(5), IDV(7). All these fault conditions are step faults, 
but the in-process feedback controller or cascade controller can compensate the 
changes in the output variables; therefore, the product quality variables under the fault 
condition IDV (1), IDV (5), and IDV (7) tend to return to normal. The monitoring 
results of IDV (1) are shown in Fig. 11.4 by the PLS, LLEPLS, CPLS, LLEOPLS 
methods. 

It is easy to find that the T? statistics in CPLS and LLEOPLS method can detect 
the process-related faults. The Te statistic of the LLEOPLS model returns back to 
the control limit which indicates that those faults are quality recoverable. Existing 
work in the literatures reports the high detection rates of these faults. For example, 
PLS, CPLS, and LLEPLS methods give many false alarms based on T? for IDV(1). 
In this case, the LLEOPLS method can accurately reflect the changes in both process 
variables and quality variables. 
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Fig. 11.5 PLS and LLEPLS monitoring result for IDV(17) and the output predicted values 
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Fig. 11.6 PLS and LLEPLS monitoring result for IDV(20) and the output predicted values 
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For IDV(1), a huge difference between FDR(T?) and PQAR can be observed. On 
the one hand, FDR(T? or 1) is based on the principal components of the process 
variables (without time delay), while PQDR is obtained based on the actual output 
values (with time delay). They are not equivalent. Moreover, considering that the 
data used for modeling are under normal operating, but not under fault conditions. 
The nonlinearity feature may not be fully excited (i.e., these nonlinearities appear 
to be linear in the normal and steady operation). When fault occurs, nonlinearity is 
fully excited and may lead to false alarms and missed alarms due to the inability 
of the original model to predict the output. In fact, T? is considered to monitor the 
quality-related fault, which implies the assumption that the output of the system can 
still be well predicted by the model in case of a failure. Although the variation of the 
predicted value of the PLS model (XKMEAS(38)) follows the variation of the actual 
output value, the predicted value is too large which results in a much larger FDR (T? in 
the PLS, CPLS, and LLEPLS models) than the PQAR. Nevertheless, the monitoring 
results of CPLS and LLEPLS are closer to reality by the orthogonalization strategy 
and the local linear embedding strategy. 


Experiment 2: Quality-Irrelevant Faults 


Fault IDV(4,11,14,17,19,20) are quality-irrelevant, in which IDV(4), IDV(11), 
IDV(14), and IDV(17) are considered as quality-independent but process related. 
The monitoring results and output predictions for IDV(17) are shown in Fig. 11.5. 
As shown in Fig. 11.5e, f, the PLS model cannot predict the output values well while 
the LLEPLS model can predict the output values very accurately. So many false 
alarms generated by T? of the PLS method. There are two possible reasons: PLS 
model does not map the nonlinear functions well, and its principal components con- 
tain the variations orthogonal to the output variables. Although CPLS improves the 
orthogonal part of PLS, its nonlinearity extracting ability is still poor. In contrast, 
the LLEPLS model captures the nonlinear structure well and filters out these false 
alarms by LLE. 

IDV(20) is another touchstone for fault detection. The monitoring results and their 
output predictions are shown in Fig. 11.6. The detection of all methods is not good 
based on PQAR, but LLEOPLS method is the best. It is found from the predicted 
results that LLEPLS model can predict the output variation well. With the removal of 
the orthogonal component, there remains a question why The still fails to yield con- 
sistent results. One of the underlying reasons is that the nonlinear dynamics excited 
by IDV(20) cannot be well described by the parameters [kx, ky] = [24, 20], which 
in turn leads to a wrong classification. Another reason could be the different con- 
trol limits between PQDR and The The statistical results of PQDR are obtained by 
assuming that the output variables obey a Gaussian distribution, and subsequently, 
their control limits are determined by a threefold standard deviation criterion. How- 
ever, the 99.75% control limit of 1 was obtained by non-parametric estimation. 
This differs from the results of the Gaussian assumption. The control limit of T 
with confidence level 99.75% for the non-parametric KDE is 9.9583, but under the 
Gaussian assumption is 12.0708). In fact, the monitoring results of T2, of LLEOPLS 
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Fig. 11.7 PQAR and the corresponding LLEOPLS monitoring results 


show that most of the alarms are transient alarms and few are continuous, where the 
transient alarms may be caused by noise. 


Experiment 3: Other Quality-Related Faults 


For other quality-related faults, the FDA results are essentially the same for these 
methods given in Table 11.1. However, the FDA results are significantly different 
for IDV (2), IDV (8), IDV (21), etc. The superiority of the proposed method is further 
verified by comparing the PQAR of IDV (2), IDV (8), IDV (21). The monitoring results 
are shown in Fig. 11.7. Although fault IDV(2) and IDV(8) are quality-related, the 
quality certainly meets the production requirements even in these fault condition. 
So the quality-related alarm is not higher. The monitoring results of the proposed 
LLEOPLS method are consistent with PQAR. 


11.6 Conclusions 


Nonlinear regression modeling and analysis is a particularly tricky task. LLEPLS 
model transforms the nonlinear regression problem into a combination of multiple 
local linear regression problems using the local linear embedding feature. It not only 
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allows the local properties of the original data to be preserved, but also allows the 
correlation between the input space and the output space to be maximized, further 
accurately predicting the quality variables. While the Tox statistic of LLEPLS model 
contains the orthogonal variation of the output. In order to eliminate it, the input 
space of LLEPLS is further orthogonally decomposed, and the corresponding sta- 
tistical criteria are established, i.e., LLEOPLS is obtained. The characteristics of the 
LLEOPLS model with nonlinear mapping and orthogonal decomposition are further 
clarified by comparing with the PLS, CPLS, and LLEPLS models in TEP benchmark 
simulation. Simulation results show that the LLEOPLS model is more effective for 
nonlinear systems and yields better (more consistent) fault detection performance, 
compared with the PLS, CPLS, and LLEPLS models. Although LLEOPLS has good 
quality-related monitoring performance for nonlinear processes, it has some limi- 
tations, such as that the low-dimensional manifold in which the sampled data are 
located is linear and that the noise subjects to Gaussian distribution. These are the 
directions of our further research. 
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Chapter 12 A) 
New Robust Projection to Latent geit 
Structure 


In many actual nonlinear systems, especially near the equilibrium point, linearity is 
the primary feature and nonlinearity is the secondary feature. For the system that 
deviates from the equilibrium point, the secondary nonlinearity or local structure 
feature can also be regarded as the small uncertainty part, just as the nonlinearity can 
be used to represent the uncertainty of a system (Wang et al. 2019). So this chapter 
also focuses on how to deal with the nonlinearity in PLS series method, but starts 
from an different view, i.e., robust PLS. Here the system nonlinearity is considered 
as uncertainty and a new robust L,-PLS is proposed. 

The traditional PLS and its nonlinear improvement methods are usually to maxi- 
mize the covariance between the input and output data, i.e., the square of Lz norm. 
Lz norm has the feature of clear physical meaning and convenient calculation, and its 
solution are unique unbiased and dense. While it is powerless for systems with rich 
local features such as nonlinear systems or uncertain systems. The proposed robust 
L,-PLS aims at the robustness of the feature extraction and the regression coeffi- 
cients. This method maintains the signal relative size during the feature extraction. 
Moreover, it guarantees the features are robust to outliers in the global statistical 
view and sensitive to the local structure information. 


12.1 Motivation of Robust L;-PLS 


Many robust PLS methods have been developed to increase the robustness of tradi- 
tional PLS method recently. Branden (2004) and Hubert (2008) replaced the empirical 
variance-covariance matrix in PLS by a robust covariance estimator, and used the 
minimum covariance determinant (MCD) estimator and the reweighted MCD estima- 
tor (RMCD) for low-dimensional data sets. Turkmen (2013) proposed the influence 
function analysis for the robust PLS estimator. Currently, the existing robust PLS 
methods use robust covariance estimation techniques with the identification of mul- 
tivariate outliers to maintain robustness (Fortuna et al. 2007; Filzmoser 2016). These 
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methods actually perform with a potential assumption that the signal is subject to 
Gaussian distribution, which is not satisfied for many industrial processes. Usually 
the industrial data are full of lots of outliers and follow either heavy-tailed distribu- 
tion (Doman’ski 2019) or multipeak distribution (Wang 2000). In other words, the 
statistical properties of this kind of data cannot be described by the robust covariance 
matrix estimation. Furthermore, outliers may contain very important information, so 
the outliers cannot be simply deleted or replaced (Liu et al. 2018). The data also have 
some nondominant local structure features besides the outliers. Robust covariance 
estimation methods also do not handle the small uncertainty correctly. 

Recently, a robust PCA (RPCA) (Kwak 2008) and a robust sparse PCA (RSPCA) 
(Meng et al. 2012) were proposed, which the two methods maximized the L; norm 
rather than the square of Lz norm of the input data. Experiments showed that they 
are efficient and robust for the data with inherent uncertainty and outliers. However, 
the two improved RPCA methods do not obtain any useful information from the 
output quality variables, so it is difficult to directly apply them to quality-relevant 
process monitoring and fault diagnosis (Zhou et al. 2018). The monitoring system 
will automatically alarm if a fault is detected whether it affects the product quality 
or not. Many alarms do not make sense for the final production quality. 

It is known that the least absolute deviation (LAD) regression is often better than 
the least squares (LS) regression for non-Gaussian signals, especially those with a 
heavy-tailed distribution. While LAD regression is immune to outliers. Moreover, 
the solution of LAD regression is not unique, and it is necessary to introduce the 
optimal technique to obtain an optimal solution. So the LAD regression of high- 
dimensional system is a time-consuming task. To improve the efficiency of the LAD 
algorithm, the idea of partial least squares (PLS) regression is used to extend the 
conventional LAD regression to partial LAD regression. The PLS-based monitoring 
method decomposes the process space through the correlation between the quality 
and the process variables, which can reflect the quality-relevant product changes in 
the process variables (Wang et al. 2017; Zhou et al. 2018). 

In order to enhance the robustness of the PLS method in a new way, this chapter 
proposes a novel dual robustness projection to latent structure regression method 
based on the Lı norm, L;-PLS. The optimization objective during the principle 
components extraction in the PLS method is a square of Lz norm, i.e., the least 
squares regression problem. L;-PLS use the L; norm maximization to replace the 
square of the L) norm maximization in the traditional PLS methods. The L; norm 
penalty terms are added to the direction vectors in the latent structure construction. 
Moreover, the partial LAD regression is used to obtain the regression coefficients. 
Therefore, the L;-PLS regression method achieves dual robust capabilities including 
robust principle components and regression coefficients. On the other hand, the L; 
norm optimization target also has the certain capability of local structural feature 
retention, compared with the Lz norm optimization goal. 

L,-PLS is distinguished from other existing robust PLS methods in several 
respects: 
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(1) The noises, outliers, and local structure features generally enter the system 
through the direction vectors, and the L; norm can maintain the relative size 
of the original signal; its direction vectors are robust to outliers and contain 
more local structure features even if there is no preprocessing of outliers. This 
facilitates the Lı norm to obtain the global and local features of the system at 
the same time without destroying the integrity of the samples; 

(2) The Lj-PLS method with the L; norm penalty term to the direction vectors can 
obtain the sparse principle components, and filter out the disturbance variables 
or those sparse PCs that are robust to disturbance variables; 

(3) The regression coefficients are obtained by the partial LAD regression. The 
corresponding regression model is also robust to outliers or uncertainties, and 
the model has better predictive performance. 


12.2 Introduction to RSPCA Method 


Consider the input data X= [x(1),---x(n)] € R”*”, where x = [x;, +--+ Xm]; m and 
n are the dimensionality of the input data and the size of the input matrix. The 
traditional PCA method aims to find the d(d < m) dimensional linear subspace with 
the largest input data variance. The objective function is as follows: 


W* = arg max |W™X|,, s.t.W'W = La, (12.1) 


where W = [w], wars wi)" €e R”*d is weight matrix. ||.||, represents the Ly norm 
of a matrix or vector. 

However, the principal components based on the PCA are usually a linear com- 
bination of the original variables usually with the non-zero weights. The non-zero 
weight results in that many irrelevant variables are included in the final model and 
cause unnecessary interference. Therefore, the spare PCA (SPCA) method was pro- 
posed to achieve the sparse expression of the principal components as much as 
possible (Liu 2014). Its objective function is 


W* = arg max | WTX |}, s.t. WTW = Ia, Wih <s, (12.2) 


where ||.||, is the L; norm of a matrix or vector. It is introduced as constraint or penalty 
term to enhance the sparsity of the principal components. s is the number of non-zero 
weights. The L; norm penalty term (||W ||, < s) realizes the sparse expression of the 
direction vector. 

Figure 12.1 shows the amplifying effect curve of L; norm and Lz norm on noise. 
The blue dotted line is the square of the L) norm (for one-dimensional data, it is 
equivalent to the Lz norm), and the red line is the Lı norm. Obviously, the Lz norm 
has an inhibitory effect on the data in |x| < 1 and has an enlarged effect on the data 
in |x| > 1. The L; norm maintains the relative size of the original data and has a 
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Fig. 12.1 The expanding 
effects of the Lı norm and 
Ly norm curve 


relatively small expansion effect on all data. In order to further improve the robustness 
of SPCA, the RSPCA method is proposed to reduce the sensitivity of the principal 
components to outliers. The Lz norm in the objective function is substituted by L; 
norm (Zou et al. 2006). The optimization function of RSPCA is given as follows: 


s.t. ww = 1, wll; < s. (12.3) 


wx = arg max | XT w| p 

Here the optimization problem is a form of L; norm maximization with an L; 
norm penalty term simultaneously. In order to obtain the principal components of 
the RSPCA method, the optimal direction vector w» is calculated by Algorithm 3. 

The convergence of Algorithm 3 and the rationality of the obtained sparse direc- 
tion vectors have been theoretically verified (Zou et al. 2006). However, Algorithm 
3 indicates that the sparseness of the data needs to be given in prior during the cal- 
culation of the sparse direction vector. Generally speaking, the sparsity of input data 
is unknown and it contains uncertainty. More importantly, the RSPCA method can- 
not be directly applied to quality-related process monitoring. Therefore, this chapter 
introduces the L; norm into the PLS method. 


12.3 Basic Principle of L;-PLS 


The double robust projection to latent structure (L;-PLS) method is given based on the 
L; norm, aiming at improving the robustness of the traditional PLS method. The PLS 
method extracts principal components from the input space and output space, and the 
principal components should satisfy the following conditions: carry the maximum 
variation information (representation) of their respective variable spaces as much as 
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Algorithm 3 RSPCA algorithm for one sparse PC 


Input: 
Data matrix X, sparsity s. 
Output: 
The s sparse PC w*. 
1: Initialization w(0) € R!*”, set w(0) = ow and k = 0. 
1, wT(k)X; > 0 
2: Letv = (v1, ..., Vm)! = X; Pi(t)X;i, where p; (k) = and X; is the 


—1, w'(k)X; <0 
ith column of the matrix X. Let y be the (s + 1) largest element in |v] . 


3: Let B= (G1,..., 8m)", where 6; =sgn(v)(lu;]—y)+,i=1,...,m, and (z) = 
1; z>0 
z, x>0 : = B ; 
0, sep POs 0; z = 0. Make w(k 4 1) = ppp andk =k H1. 
—-l; z<0 


4: If w(k) Æ w(k + 1), return to Step 2; otherwise continue to Step 5. 


pni T 
5: If there is i such that wT(k)X; = 0 and sgn (E |w(k) ;X ji I) Æ 0, then let EEIN 


and return to Step 2; otherwise continue to Step 6; Aw is a small non-zero random vector. 
6: Set w* = w(k) and stop iteration. 
7: return w*; 


possible, and the degree of correlation between different variable spaces is as large 
as possible (correlation). Take the extraction of the first principal component as an 
example. The PLS method is expressed as follows: 


Ej FoF} Eow: = w: 
FLEE} Foci = 8c, (a 
where w; and c; are the direction vector of the principle components t; and u1. The 
optimization problem (12.4) is transformed into finding the unit direction vectors 
w; and cı corresponding to the maximum eigenvalue 6? of matrices E oF oF TE 0 
and F TE0E} F o, respectively. It can be seen that the solution of (12.4) satisfies the 
requirements about the representation and correlation in PLS method. 
Then, multiply both sides of the equation (12.4) by wT and ers respectively, and 


obtain 
w! Ej FoF} Eow: = 6, s.t.w wy = 1 


(12.5) 
cT FIE El Foci = 0°, stec = 1. 
To simplify further, we can get 
2 
w* = arg max | wT ET Fo|/, st. ww =1 
1 8 l 120 ol; 11 (12.6) 


ci = arg max ct FTEo|,. st. ce, =1. 


The optimal problem of the traditional PLS (12.4) is expressed as Lz norm optimiza- 
tion in (12.6). wï and c* are the optimal direction vectors. 
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It is known that the noise is flowed into the regression model through the direction 
vector ( w; and cı ) in most cases, which affects the estimation of the regression 
parameters in the PLS method. Similar as the idea of equation (12.3), we replace the 
maximization of the Lz norm in the objective function (12.6) with the maximization 
of L; norm. Moreover, the L; norm penalty term is added to the direction vector. 
Therefore, the objective function of the L;-PLS method based on the L; norm is 
given as follows: 


wï arg max |w] E Fo||, . s.t. wrw = 1, lwil < sı 


(12.7) 


cy = arg max lef Fo Eol|, » s.t. Cie) =1, licl < s2, 


where sı and sz are the sparsity of input spatial data and output spatial data, respec- 
tively. 

According to the above analysis, although the direction vectors (w; and c1) in 
(12.4) contains the correlation between the input data Ep and the output data Fo, 
fortunately, they can be solved separately in (12.7). Therefore, Algorithm 3 also is 
suitable for the solution of (12.7) by replacing the corresponding input data matrix 
X with E i Fo and F A Eo, respectively. It is noted that the solution of w; and c; are 
independent but not jointed by Algorithm 3. 

Once the optimal direction vectors w; and cı are obtained, the score vectors in 
the latent space, i.e., the first principle component pair, tı and u; can be calculated 


tı = Eow,,u, = Foci. (12.8) 


Next, the regression coefficients (loading vectors) of Fo and Eo to tı will be 
established. In the traditional PLS model, the regression coefficients p; and q, are 
estimated by least squares, namely, 


T 2 

Pı = Egti/ieall 12.9 

qı = Foe /iel?. oe 

Similarly, least squares estimation is also susceptible to outliers, and the least 

absolute deviation (LAD) method is introduced to deal with this problem. Therefore, 

in order to further improve the robustness, LAD regression is used to solve the 
regression coefficients in the L;-PLS algorithm, namely, 


= arg min 
= arg min 


|Eo — tip, 


lı 
| Fo — tq) | (12.10) 


1? 


where pj and q* are the optimal loading vectors of (12.10). 

Obviously, the essence of (12.10) is also the form of L; norm. When there are few 
outliers, it is not necessary to use the norm to solve the regression coefficient. Due 
to the direction vector has been solved by maximizing the L; norm, the influence of 
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the outlier has been reduced, and as can be seen from Fig. 12.1. When the outlier is 
small, the Lə norm and the L; norm have the same effect. 
Calculate the residual matrix E; and F4: 


E, = Eo — ti p}, Fi = Fo — tiq? (12.11) 


Similar as the extraction of the first principal components pair, the other prin- 
cipal components are calculated iteratively by decomposing the residuals E; and 
F; (i =1,...,d — 1). The extraction of principal components is stopped until the 
model determined by the extracted principal components satisfies the desired require- 
ments. 

The dual robustness of the L;-PLS algorithm is reflected in the following two 
aspects: 


1. Different from the PLS algorithm, Algorithm 3 is used to calculate the direction 
vector each time. By maximizing the L; norm in the objective function, and 
adding the Lı norm penalty term to the direction vector, the robustness of the L4- 
PLS algorithm is improved. This achieves the robustness of principal component 
extraction. 

2. In the case of many outliers, the regression coefficients can be calculated using 
least absolute estimation, which can overcome the shortcomings of least squares 
estimation that is easily affected by outliers, and further enhance the robustness 
of the L,-PLS algorithm. 


12.4 L ,-PLS-Based Process Monitoring 


It is found that only the calculation process of the direction vector w; and c (12.7) 
or the regression coefficient p; and q, (12.10) is improved in the L;-PLS method, 
and other steps are not affected. Therefore, the monitoring process based on the L4- 
PLS method is the same as the PLS method. In the process monitoring based on the 
L,-PLS method, the T? and T°? statistics are still used to monitor the principal com- 
ponent subspace and the remaining subspace. Then, the L;-PLS-based monitoring is 
described in detail in Algorithm 4 (offline process training) and Algorithm 5 (online 
process monitoring). The corresponding flowchart is shown in Fig. 12.2. 

In Algorithms 4 and 5, A and A, represent the sample covariance matrix. The 
non-parametric kernel density estimation (KDE) method (1.33) is used to estimate 
the corresponding control limits of T? and TŽ. 

There is still a key problem in the implementation of Algorithm 4: the sparsity 
degree sı and sz need to be given in prior. There are two common strategies to deter- 
mine sı and s2. (1) The first one is the variable importance in prediction (VIP) method 
(Farrés et al. 2015). It judges whether the variable is an irrelevant variable based on 
the VIP score of the jth predicted value of the response variable. Usually, the “greater 
than e” criterion is used as the selection criterion. More precisely, the threshold e 
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Algorithm 4 L,-PLS method for Offline process training 


Input: 
Normal data sets X = [x1,...,Xm] € R”*”, Y=[y1,..., yI] € Rm! sparsity sı and s2. 
Output: 
The control limits Tin and T lim’ 
(1) Normalized X and Y as Eo and Fo, 
(2) Fori = 1,...,d (d is obtained by cross-validation): 
(2.1) Apply Algorithm 3 to the projected matrices E Li Fi—ı and F Li Ej- 
to get the direction vectors w; and ¢;, respectively. 
(2.2) Calculate the score vectors: t; = Ej;-|w;, uj = Fi—1Ci. 
(2.3) Calculate the load vectors: 
P= Ehty/ieall’ eo př =argmin || Eo — tpt |, 
qı = Foti /|ltill? qf =argmin||Fo-tiq{ |, 
(2.4) Calculate the Residual matrix: E; = Ei—1 — tip}, F; = Fj-,- uiq;. 
(3) Describe t; with the original matrix Eg: T = ER, 
i-l 
R =[r1,..., ra], in which r; = [| An- wjpjDwi. 


j=l 


Ê = TP" = E RP" 


E = Eo — Ê = Eo, — RP") 


(4) For a normalized data sample x, calculate its estimate, residual and the corresponding PC 
value. 


(5) Calculate the statistics T? and T: 


T? :=tA lt" =X TIT)!" 


T? = eAz'e™ = e( E 


2 De 
return Tým and To ims 


should be adjusted based on the distribution of the overall data in different situations. 
(2) The second strategy is the selectivity ratio method (Branden and Hubert 2004). 
The variable selection ratio is calculated according to the ratio of the interpretation of 
the X variable on the Y target projection component to the residual variance. Then F 
test is performed to define the boundary between important variables and irrelevant 
variables. Since the VIP method is simple and easy to implement, the VIP method 
is selected to determine the sparsity sı and sz here. 
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Algorithm 5 L,-PLS method for Online process monitoring 
Input: 

New normalized data sets Xnew and Yew. 
Output: 

Online process monitoring results. 

(1) Calculate the new score vectors: thew = Xnew R. 

(2) Calculate the new prediction matrix and new residual: 


= T T 
Xnew = tnewP =XnewRP 


A T 
new = Xnew — Xnew = Xnew(In — RP `). 


(3) Calculate the new statistics T2,,, and T2 


new e,new* 


new 


= 1 E 
TZ =tnewA Mew = tev | T r} tt 


=l 
1 -T- 
2 =, SL Ty 4 T 
Te new om CnewAg enew = enew {- =a E enew 
(4) Compare To... and Be ew With the corresponding control limits Thm and 
2 
Te tim: 


return Online process monitoring results. 


It is worth noting that the role of sparsity is to achieve variable selection. If the 
established system model contains many irrelevant variables, giving the sparsity is 
helpful to limit the number of irrelevant variables, so as to realize L,-sparse-PLS. 
However, if the sparsity of the input data is uncertain, the sparsity degree sı and s2 
can be set equal to the variable number in the input and output space, respectively, 
to eliminate the uncertainty caused by the sparsification. In this view, the proposed 
L,-PLS method is uniformly called as L,-(S)PLS method based on the different 
sparsity. 


12.5 TE Simulation Analysis 


In this simulation, the input variable X is composed of 31 variables [KMEAS(1:22)] 
and [XMV(1:11) (except XMV(5) and XMV(9))]. The output variable Y consists 
of the quality components G (XMEAS(35)) and H (XMEAS(36)). Two simulation 
examples are used to verify the effectiveness of the Lı -PLS method for fault detection. 
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Fig. 12.2 The Flow chart of Algorithms 4 and 5 


12.5.1 Robustness of Principal Components 


The robustness of the Lj-PLS method is mainly implemented on the direction vec- 
tors, which directly reflects the robustness of the PCs. The variation of the PC struc- 
ture caused by outliers therefore is the focus of robustness analysis. Here results 
of PLS and RPLS methods are given for comparison. The input and output data 
(X e R°%*3! Y =e R*?) are sampled from the TE process under the normal 
operation for training data. In order to test further the proposed L;-PLS, the outliers 
are added in the input space in the following form: 


X(k) = X*(k) + Ej(k), (12.12) 
where X*(k) is the kth normal sample (k = 1, 2,..., 960) Æ ; is the j-th randomly 
generated outlier that obey Gaussian distribution £ ; ~ N (0, 2000). For ease of ver- 
ification, three kinds of repeatable outliers that are generated using a specific random 
seed are added to the training set, 
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Fig. 12.3 The relative change rates of tı using PLS and L}-PLS 


=, (12) = [-71.294, 4.929, 35.199, —0.100]" for x14:17 
E>(140) = [4.164, —16.912, —66.307]" for x29:31 
E3(200) = [—1.960, 42.969, 77.737, —19.239, —72.776, 7.439]! for x16. 


Outlier Æ (12) means that only the 14, 15, 16, and 17th variables at the 12th 
sample time X (12) are abnormal, and the other variables at other sample times are 
still normal. The other two outliers have similar meanings. 

The sparsity sı and s2 in the L;-PLS method are set to 31 and 2. The sparsity is equal 
to the variable number of input and output space, respective. In other words, the L4- 
PLS method can reflect the changes in all variables. The components numbers d are 
determined using cross-validation. They are 6, 6, and 2 for PLS, RPLS, and L;-PLS 
methods, respectively. The principle components are t; = Di wjxj,i=1,...,d, 
in which w;; is the jth element of r;. The coefficients w;; are used to reflect whether 
the outliers affect the principle components. The relative rates of change (RRC) 
indices are defined as follows: 


RRC; = max{| Wij normal = Wij,outliers |} 
(12.13) 
RRC); = (|W; normal — Wj outliers II15 


where Wi normal = [Wij ]normal and Wi outliers = [wij louttiers are the normalized coef- 
ficient vectors with normal samples and adding outliers samples for the i, PC, 
respectively. 

RRC, represents the maximum absolute deviation of the two coefficient sets, 
which indicates the worst changes of the normalized w;;. RRC» represents the sum 
of the absolute deviations of the two coefficient sets, which indicates the overall 
change of the normalized w;;. 

The normalized coefficient w;; values of the first two PCs (tı and t2) of the PLS, 
RPLS and L,;-PLS methods are shown in Figs. 12.3 and 12.4. The corresponding 
indices RRC;, i = 1, 2 are given in Table 12.1 (a smaller value is better). 
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Fig. 12.4 The relative change rates of t2 using PLS and L,-PLS 
Table 12.1 RRC; of tı and t of the PLS, Lı -PLS and L;-SPLS methods 


PLS RPLS L,-PLS 
ti h ti t2 ti 


It can be seen from Figs. 12.3—12.4 and Table 12.1 that no matter which method 
is used, the outliers will always affect the structure of the PCs to some extent. In 
general, the outliers have a large adverse effect on the PCs extraction of the PLS 
method, and thus results in the largest change in its PC structures. With the robust 
covariance estimation method, the outliers have little effect on the PCs extraction 
of the RPLS method. L;-PLS method only uses the L; norm to be insensitive to 
outliers, without any outliers processing. Outliers that cause changes in the structure 
of its two PCs are nearly identical and within an acceptable range, whether in the 
RRC, or RRC). The samples considered to be outliers may be a true reflection of 
the system state when the data set follows a heavy-tailed distribution (Doman’ski 
2019). It is more important to retain all the samples to extract the PCs, although the 
outliers have a certain influence on the direction vectors. 

By further analyzing the structure of t; and f, it can be easily found that the 
extracted PCs by those methods are quite different. In order to better explain the 
structural differences of ft; and t in different methods, IDV(14) is taken as an example 
for in-depth analysis. The typical process variable monitoring results of IDV(14) are 
given in Fig. 12.5, in which, x9, x21 and x39 have similar monitoring results. Among 
the t; and f2, the sum of the absolute weights for x9, x21 and x39 of the PLS method 
(0.062) is more than twice that of the L}-PLS method (0.025). 

These weight differences do not significantly affect the output prediction and the 
monitoring performance in the normal operation. But these differences are amplified 
in the fault modes. For example, consider the monitoring under the fault modes 
IDV(14) and IDV(17). The role of x2; and x39 (especially x39) in the PLS method is 
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Fig. 12.5 Typical process variable monitoring results of IDV(14) 


exaggerated, leading to incorrect predictions and quality-relevant monitoring results 
(see Figs. 12.6 and 12.7). Correspondingly, the L; norm can better maintain the 
relative size of those variables, therefore, the role of x2; and x39 in the extracted PCs 
is not exaggerated. In other words, the extracted PCs by the L; norm better capture 
the relationship between the input space and output space. 


12.5.2 Robustness of Prediction and Monitoring Performance 


The robustness of the principal components of the Lj-PLS method is discussed in 
the previous section. But the number of principal components of the three methods 
is different, which only reflects one aspect of the robustness. Now, the robustness of 
prediction performance and monitoring is analyzed further, especially the prediction 
performance directly reflects the quality of the model. There are 21 types of faults in 
the TE process. The fault IDV(21) is a fault that the output drifts slowly, caused by the 
constant change of the steam valve position. So it does not reflect the robustness of 
the model. Therefore, the first 20 faults are analyzed in this simulation experiment. In 
this simulation, the sparsity in the L;-SPLS model is determined by the VIP method: 
input space sı = 14, output space s2 = 2. 


Experiment 1: Prediction Performance Analysis 


In this experiment, the L;-PLS model shows good output prediction results for the 
20 fault data sets. L;-PLS(outliers) and PLS(outliers) mean that the two models 
are trained by the normal operation data with adding outliers, described in previous 
Sect. 12.5.1. In order to illustrate the above conclusions more clearly, four faults 
IDV(7), IDV(14), IDV(17), and IDV(18) are selected to compare the prediction per- 
formance of the PLS model and the L;-PLS model. The output prediction results are 
good for all fault modes, but the four faults come from four different fault types, and 
the results of the Lj-PLS model and the PLS model are quite different. Figures 12.6 
and 12.7 give the output prediction results of the fault IDV(7), IDV(14), IDV(17), 
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Fig. 12.6 Output predicted values for IDV(7), IDV(14), IDV(17), and IDV(18) using PLS(outliers) 


and IDV(18). The horizontal axis represents data samples, and the vertical axis rep- 
resents output values. The blue dashed line is the actual output value, and the green 
is the predicted output value. 

In these prediction and monitoring diagrams, the first 160 samples are normal 
data, and the last 800 samples are data under different fault modes. The output pre- 
diction of fault IDV(7) shows a consistent conclusion under the step-change fault. 
The feedback controller or cascade controller reduces the impact of faults and abnor- 
mal values on product quality. For the other three types of fault IDV(14), IDV(17), 
and IDV(18), there are some differences in their output prediction results. When 
the system is under the normal operation, the PLS and L;-PLS models have the 
same good prediction results. However, after adding outliers, the PLS method can- 
not accurately predict the output (Fig. 12.6), while the L,-PLS method still quickly 
detects the output changes and makes correct predictions (Fig. 12.7). In particular, 
for faults IDV(17) and IDV(18), the PLS method gives a serious wrong predictions. 
Experiments show that the prediction performance of the L,;-PLS method is better 
than PLS. Even if the data is contaminated by outliers, L}-PLS can still predict the 
output accurately. In other words, the L;-PLS model has stronger robust prediction 
performance. 
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Fig. 12.7 Output predicted values for IDV(7), IDV(14), IDV(17), and IDV(18) using Lı- 
PLS(outliers) 


Experiment 2: Monitoring Performance Analysis 


The robustness of monitoring performance is mainly verified by the accuracy of 
fault detection. The detection indices are FDR and FAR (4.1), the control limit is 
calculated with the confidence level 99.75% for both PLS and L,-PLS methods. The 
FAR results of the two models are basically same, this indicates that the proposed 
L,-PLS method does not increase the risk of false alarms, so it is not analyzed in this 
section. Table 12.2 lists the FDR results of the first 20 faults without adding outliers, 
corresponding to the models PLS, L;-PLS and L;-SPLS respectively. Table 12.3 
shows the FDR results of 20 faults after adding outliers, corresponding to the models 
PLS (outliers), L;-PLS (outliers), and L;-SPLS (outliers). 

For serious quality-related faults IDV(2), IDV(6), IDV(8), IDV(12), IDV(13), and 
IDV(18), the six models give consistent results. Therefore, these faults are not ana- 
lyzed in this chapter. For other types of faults, their results are very different, including 
the quality-irrelevant faults, the quality-recoverable faults, and slight quality-related 
faults. The detailed analysis of the three situations is given below. In the monitoring 
figures of this section, the blue line represents the value of the statistic, where the 
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Table 12.2 FDRs of PLS, L}-PLS, and L1-SPLS 


PLS L\-PLS L1-SPLS 

IDV T T T T T T? 

1 99.63 99.75 60.00 99.75 31.38 99.75 
2 98.50 98.25 98.25 98.38 98.25 98.38 
3 1.00 1.38 0.75 1.75 0.50 1.75 
4 19.13 100.00 0.88 100.00 0.00 100.00 
5 22.00 100.00 18.38 100.00 17.13 100.00 
6 99.25 100.00 98.38 100.00 98.13 100.00 
7 100.00 100.00 68.75 100.00 31.38 100.00 
8 96.00 97.88 89.00 97.88 88.50 97.88 
9 0.50 1.13 0.25 1.38 0.38 1.38 
10 26.38 84.25 19.13 85.38 15.63 85.38 
11 26.63 76.50 1.13 77.88 0.88 77.88 
12 97.50 99.88 84.00 99.88 84.00 99.88 
13 94.88 95.13 82.13 95.25 82.25 95.25 
14 91.50 100.00 0.38 100.00 0.00 100.00 
15 1.25 2.63 1.00 3.75 0.63 3.75 
16 20.13 42.75 9.00 46.13 7.00 46.13 
17 77.38 96.75 10.00 97.00 1.63 97.00 
18 89.38 90.13 88.75 90.13 88.75 90.13 
19 0.50 34.50 0.13 37.88 0.00 37.88 
20 30.50 90.50 20.75 90.38 19.00 90.38 


upper curve is T?, and the lower is TŻ. The system alarms if the blue line exceeds 
the red control limit. 


Case 1: Quality Irrelevant Fault 


It can be found from Table 12.2 that very low alarm values are given for faults IDV(3), 
IDV(9), IDV(15), and IDV(19). However, the alarm values of the L|-PLS and L4- 
SPLS models are lower, which indicates that fewer false alarms will occur during 
the monitoring. It can also be seen from the corresponding Figs. 12.8, 12.9, 12.10, 
12.11, and 12.12, the alarm points of the latter two models are much less. For faults 
IDV(4), IDV(11), and IDV(14), they are all related to the reactor cooling water and 
hardly affect the quality of output products. The PLS model gives a higher alarm 
value, which may lead to serious false alarms, while the L;-PLS model effectively 
avoids these alarms and reduces the number of false alarms. In addition, the Lı -PLS 
model eliminates most of the false alarms in the monitoring Figs. 12.8, 12.9, 12.10, 
and the L1-SPLS model almost eliminates all false alarms. 

When adding outliers, the PLS model provides the same wrong results for quality- 
irrelevant faults. The specific FDR values are shown in Table 12.3. However, the 
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Table 12.3 FDRs of PLS(outliers), L;-PLS(outliers), and Lj-SPLS(outliers) 


PLS (outliers) L-PLS(outliers) L,-SPLS (outliers) 

IDV T? T T T T 

1 99.88 99.75 28.38 99.75 36.63 
2 98.63 98.25 98.00 98.25 98.25 
3 3.25 0.88 0.13 1.13 0.63 
4 7.63 100.00 0.25 100.00 0.00 
5 24.88 27.88 14.75 28.38 16.88 
6 99.75 100.00 98.38 100.00 98.25 
7 100.00 100.00 59.88 100.00 29.50 
8 96.50 97.75 84.50 97.88 88.00 
9 0.88 0.88 0.00 1.00 0.38 
10 37.50 77.63 11.00 80.50 15.25 
11 16.00 73.75 0.50 74.75 0.88 
12 95.88 99.25 78.50 99.25 83.63 
13 95.50 95.00 80.25 95.25 82.00 
14 89.75 100.00 0.00 100.00 0.00 
15 4.38 0.50 0.13 0.88 0.75 
16 33.88 28.38 4.50 35.13 6.25 
17 76.88 96.63 6.13 96.63 1.50 
18 90.00 89.88 88.00 89.88 88.63 
19 1.13 28.38 0.00 30.00 0.00 
20 36.50 771.13 15.63 77.75 19.75 
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Fig. 12.8 PLS, L1-PLS and L1-SPLS monitoring results for IDV (4) 
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Fig. 12.9 PLS, L1-PLS and L)-SPLS monitoring results for IDV(11) 
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Fig. 12.10 PLS, L;-PLS and L;-SPLS monitoring results for IDV(14) 
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Fig. 12.11 PLS, L;-PLS and L;-SPLS monitoring results for IDV (15) 


PLS for IDV(19) L1-PLS for IDV(19) L1-SPLS for IDV(19) 


o o o 
O 100 200 300 400 500 600 700 800 900 1000 o 100 200 300 400 so 600 700 800 s00 1000 O 100 200 300 400 50 600 700 800 900 1000 
sample sample 


o o o 
O 100 200 300 400 500 600 700 800 900 1000 O 100 200 300 400 500 600 700 600 900 1000 O 100 200 300 400 50 600 700 800 900 1000 
sample sample sample 


Fig. 12.12 PLS, L1-PLS and L1-SPLS monitoring results for IDV (19) 


monitoring effect of the Lı-PLS model is still very good, for fault IDV (9), IDV (14), 
and IDV(19). The detection rate has been reduced to 0, which means that false alarms 
are completely eliminated in these cases. Therefore, the L1-(S)PLS model will not 
interfere with the fault monitoring results after adding outliers. It should be noted that 
the monitoring performance of the Lı-PLS model after adding outliers (Table 12.3) 
is better than the normal conditions (Table 12.2). The possible reason is outliers, and 
the total noise in the input data becomes larger. The L;-PLS method can filter out 
noise more effectively during the modeling. Therefore, the established model is more 
accurate and the monitoring performance is improved. 


Case 2: Quality-Recoverable Fault 


Faults IDV(1), IDV(5), and IDV(7) are quality-recoverable faults. The prediction 
value should tend to return to normal, but the statistic should be kept at a higher 
value. Figure 12.13 shows the monitoring results of the three models on the fault 
IDV(1). It can be seen that both the Lj-PLS and L,-SPLS model methods give the 
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Fig. 12.13 PLS, L;-PLS and L;-SPLS monitoring results for IDV(1) 
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Fig. 12.14 PLS, Lı-PLS and L1-SPLS monitoring results for IDV (5) 
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Fig. 12.15 PLS, L;-PLS and L;-SPLS monitoring results for IDV(7) 


correct alarm results. In the PLS model, the value of the statistic exceeds the control 
limit, so a false alarm is generated in the process monitoring. For the fault IDV(5), 
it is also a process-related fault. It can be seen from Tables 12.2 and 12.3 that the 
fault detection rates of the L;-PLS and L,-SPLS models are lower than the PLS 
model, which means that the monitoring results are more accurate. Figures 12.14 
and 12.16, respectively, show the monitoring diagrams of the three models for the 
fault IDV(5) in the normal case (without adding outliers) and with adding outliers. 
For fault IDV(7), the corresponding monitoring results are shown in Fig. 12.15. The 
PLS model gives completely wrong result, while the results of the other two models 
are more accurate. 

The detection result for fault IDV(1) obtained by the L,-PLS (outliers) model 
seems to be better than the L;-PLS model, and the monitoring results are more rea- 
sonable. In addition, for the fault IDV (5), although the monitoring results of the 
L,-PLS and L,-SPLS (outliers) models may not be ideal, as shown in Fig. 12.14. 
The T statistics of the L;-PLS and L;-SPLS models can detect the input space 
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Fig. 12.16 PLS(outliers), L;-PLS(outliers) and L;-SPLS(outliers) monitoring results for IDV (5) 
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Fig. 12.17 Typical process variable monitoring results of IDV(5) 


process-related faults. But the PLS (outliers), L;-PLS (outliers) and L;-SPLS (out- 
liers) models gave wrong results (Fig. 12.16). 

There are two possible reasons for this phenomenon. Firstly, the outliers were 
added directly without being regulated by the dynamic system, so its influence on the 
extraction of the principal components cannot be determined directly. Secondly, the 
typical process dynamics corresponding to fault IDV(5) is shown in Fig. 12.17. Only 
the variable 31 is a step change in all the monitored variables, and the rest gradually 
returns to the normal under the action of controller. In terms of the composition of the 
principal components, the contribution of variable 31 to the principal components is 
small. Therefore, its role is more in the residual space in the normal case (without 
adding outliers). After the outlier is added, its contribution to the principal component 
increases, which means its role in the residual space is weakened. It in turn causes 
the monitoring indicators in the residual space to return back to normal. On the other 
hand, the percentage of its contribution to the principal component is still small, 
so the monitoring indicators on the principal metric space also do not significantly 
reflect its characteristics. 


Case 2: Slight Quality Related Fault 


Fault IDV (16) and IDV (17) have a slight impact on quality, which means that they 
have almost no impact on output quality. Figure 12.18 shows the monitoring results of 
the three models after adding outliers. The fault monitoring results of PLS (outliers) 
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Fig. 12.18 PLS(outliers), Lı -PLS(outliers), and L)-SPLS(outliers) monitoring results for IDV(17) 


model is very bad, there have been many false positives. The L,-PLS (outliers) model 
and L,-SPLS (outliers) model effectively reduce these false alarms. It can also be 
seen from the corresponding FDR that the monitoring results of the L;-PLS (outliers) 
model and the L;-SPLS (outliers) model are more reasonable. 

It can be seen from the above comparison results that even if outliers are added 
to the input data, the monitoring results of the L,;-(S)PLS model have also been 
greatly improved. In other words, the L;-(S)PLS model improves the robustness 
performance and fault detection performance. 


12.6 Conclusions 


This chapter proposes a quality-related statistical monitoring method of double robust 
projection to latent structure (L,;-PLS), which enhances the robustness of the PLS 
algorithm from two aspects. On the one hand, the L;-PLS method replaces the L2 
norm in the objective function with the L; norm, and adds the L; norm penalty term 
to the direction vector; On the other hand, the regression coefficient of the L;-PLS 
algorithm can also be obtained by the L; norm. Therefore, the L;-PLS algorithm 
has double robustness. Then a monitoring model based on the L;-PLS method is 
established, the robust performance and monitoring performance are verified on the 
TE process simulation platform. The results show that the L;-PLS method has better 
robustness and better performance in process monitoring and fault diagnosis. 
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Chapter 13 A) 
Bayesian Causal Network for Discrete ciecie; 
Variables 


Ensuring the safety of industrial systems requires not only detecting the faults, but 
also locating them so that they can be eliminated. The previous chapters have dis- 
cussed the fault detection and identification methods. Fault traceability is also an 
important issue in industrial system. This chapter and Chap. 14 aim at the fault 
inference and root tracking based on the probabilistic graphical model. This model 
explores the internal linkages of system variables quantitatively and qualitatively, so 
it avoids the bottleneck of multivariate statistical model without clear mechanism. 
The exacted features or principle components of multivariate statistical model are 
linear or nonlinear combinations of system variables and have not any physical mean- 
ing. So the multivariate statistical model is good at fault detection and identification, 
but not at fault root tracking. 

Bayesian network (BN) can estimate and predict the potentially harmful factors 
of the general system, but its structure learning has some deficiencies when it is 
applied to the complex system, such as complex training mechanism and variable 
causalities. In order to simplify the network structure, lots of assumptions should 
be presupposed and it inevitably causes the loss of generality. Usually, a generative 
model (linear or nonlinear) is built to explain the data generating process, i.e., the 
causalities. A variety of causal discovery methods have been proposed recently to 
find the causalities (Hyvärinen et al. 2010; Hong et al. 2017). The most classical 
method is the linear non-gaussian acyclic model (LINGAM) (Shimizu et al. 2010), 
in which the full structure of BN is identifiable without pre-specifying a causal order 
of the variables. The improved LINGAM method is proposed to estimate the causal 
order of variables without any prior structure knowledge and provide better statistical 
performance (Shimizu et al. 2011). The nonlinear causality of a pair of variables is 
discovered in Johnson and Bhattacharyya (2015), where the proposed method shows 
a limitation when dealing with the multivariate variables. 

The above approaches exploit the complexity of the marginal and conditional 
probability distributions in one way or the other. Despite the large number of meth- 
ods for bivariate causal discovery have been proposed over the last few years, their 
practical performance has not been studied systematically. These methods have yet 
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to be applied to the actual industrial systems which usually do not meet the linear 
and bivariate assumptions. To address the above issues, this chapter proposes a more 
generalized multivariate post-nonlinear acyclic causal model for the complex indus- 
trial process. The proposed multivariate post-nonlinear acyclic causal model, named 
as Bayesian Causal Network (BCN), can easily find the multi-variables causality. It 
shows more compact structure and consistency with mechanism, compared with the 
traditional BN structure. In addition, it avoids the complex learning mechanism of 
traditional BN, so is easier to implement without compromising accuracy. 


13.1 Construction of Bayesian Causal Network 


It is known that there are many ways to describe the system characteristic according to 
the observational data and expert knowledge, such as graph model (Hipel et al. 2011), 
neural network model (Li et al. 2016), fuzzy model (Jiang et al. 2015). The graph 
model is composed of points and lines to describe the system structure and the causal 
relationships among variables. It provides an effective method for studying various 
systems, especially the complex systems. Bayesian network, a typical graph model, 
is the main method to deal with the knowledge representations and uncertainties 
based on the probability theory. It builds the causality and probability within the 
process components and the system variables from the prior knowledge and process 
data. BN consists of the structure learning and the parameter learning, in which the 
structure learning aims at determining the causalities within system variables and the 
parameter learning aims at revealing the quantitative relationship of these causalities. 
Bayesian network has been applied to fault diagnosis, financial analysis, automatic 
target recognition, military, and many other areas (Zhu et al. 2017). 


13.1.1 Description of Bayesian Network 


Bayesian network, also known as Belief Network or directed acyclic graphical model, 
is a probabilistic graphical model. It first proposed by Judea Pearl in 1985 (Pearl 
1986). It is an uncertainty processing model that simulates the causal relationship 
in the human reasoning process, and its network topology is a directed acyclic 
graph (DAG). The nodes in the directed acyclic graph represent the random vari- 
ables, including the observable variables, hidden variables, unknown parameters, 
etc. Variables or propositions that are believed to have a causal relationship (or 
non-conditional independence) are connected by arrows (in other words, the arrow 
connecting two nodes represents whether the two random variables have a causal 
relationship or are not conditionally independent). If two nodes are connected by a 
single arrow, it means that one of the nodes is “cause” and the other is “effect”, a 
conditional probability value is used to describe the causality degree quantitatively. 
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Fig. 13.1 Bayesian network 
example a 


For example, assume that node A directly affects node B, then A —> B. The 
arrow from A to B is used to establish a directed arc (A, B) from node A to node B, 
and the weight (its connection strength) is determined by the conditional probability 
P(B|A). In short, a BN is formed by drawing the random variables in a directed 
graph according to whether they are conditionally independent. It usually uses circle 
to represent the random variables (nodes) and arrow to represent the conditional 
dependencies. Figure 13.1 gives a simple Bayesian network (Ishak et al. 2011). 


13.1.2 Establishing Multivariate Causal Structure 


Model-based causal discovery assumes a generative model to explain the data gener- 
ating process. When the existing knowledge about the data model is unavailable, the 
assumed model should be sufficiently general so that it can be adapted to approximate 
the real data generation process. Furthermore, the model should be identified such 
that it could distinguish the causes from the effects. A nonlinear and multivariable 
system always possesses the following three characteristics (Chen et al. 2018): 


1. The multivariate causalities are usually nonlinear. 

2. The final target variable is affected by its cause variables and some noise who is 
independent from the causes. 

3. Sensors or measurements may introduce nonlinear distortions into the observed 
value of the variables. 


To discover the causality of multivariable in complex industrial systems, a more gen- 
eralized multivariate post-nonlinear acyclic causal model with inner additive noise is 
proposed. The model is in the form of graph theory and Bayesian network structure. 
Assume that there is a DAG to represent the relationship among multiple observed 
variables. Mathematically, the generating process of X; is 


Xi = fiafia (PAi) + ei), (13.1) 


where the observed variables X;,i = {1,2,...,n} are arranged in a causal order, 
such that no later variable causes any earlier variable. P A; is the direct cause of 
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Xi. fiı denotes the nonlinear effect of this cause, and jf; denotes the invertible 
post-nonlinear distortion in variable X;. e; is the independent disturbance which is 
a continuous-valued random variable with non-gaussian distributions of non-zero 
variances. Model (13.1) satisfies the aforementioned three characteristics: function 
fi ı accounts for the nonlinear effect of the causes P A;; e; is the noise effect during the 
transmission from P A; to X;; invertible function f; 2 reflects the nonlinear distortion 
caused by the sensor or measurement. 

Randomly select a pair of variables X; and X}, i,j = {1,2,...,n}. Assume that 
the pair (X;, X ;) has the causal relation X; —> Xj. It’s data generating process can 
be described in a generated model, 


X; = fja fja(Xi) + ej), (13.2) 
where e; is independent from X;. Define s; = Fj, (Xi), sj ê ej, and s; is indepen- 


dent from sj. 
Rewrite the generating process X; — X; as follows: 


Xi = Fri (si), 
X; = fi2(si ts) 


(13.3) 


X; and X; in (13.3) are post-nonlinear (PNL) mixtures of independent sources s; 
and s ;. The PNL mixing model can be seen as a special case of the general nonlinear 
independent component analysis (ICA) model. Here we use nonlinear ICA method 
to solve this problem (13.3). 

Generally there are two possibility to describe the causal relation between any 
two random variables X; and X ;, (X; — X; or X; — X;). We should identify the 
correct relation by judging which one satisfies the assumed model (13.2). If the causal 
relation is X; > X; (i.e., X; and X; satisfy the model (13.2)), we can invert the 
data generating process (13.2) to recover the disturbance e j, which is expected to be 
independent from X;. Two steps are used to examine the possible causal relationships 
between variables. 

In the first step, recover the disturbance e; corresponding to the assumed causal 
relation X; — Xj based on the constrained nonlinear ICA. If this causal relation 
holds, there exist nonlinear functions f = and f; ı such that 


e; = f(X; — FG. (13.4) 


where e; is independent from X;. Thus perform nonlinear ICA using the structure 
in Fig. 13.2 and the outputs of system are 


us (13.5) 
Yj =e; = gj(X;) — gi (Xi)). 
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Fig. 13.2 Constrained 
nonlinear ICA system used 
to verify if the causal relation 
Xi > Xj holds 


The nonlinearities g; and g; is modeled by Multi-layer perceptrons (MLP’s), and 
the parameters in g; and g; are learned by making Y; and Y; as independent as 
possible, i.e., minimizing the mutual information between Y; and Y ;, 


I(¥;,Y¥j) = H(¥;)+ HY ;) -— HY), (13.6) 


where H (Y) is the joint entropy of Y = (Y;, Y)", 


H(Y) = -E [log py(¥)] 
= —E [log py(X) — log |J] (13.7) 
= H(X) + E[log|J|] - 


The joint density of Y = (¥;, Y)" is py(Y) = px(X)/|J|. J is the Jacobian 
matrix of the transformation from (X;, X ;) to (Y;, Y j), i.e., 


J= O(Y;, Y;) 
O(X;, Xj)’ 
(13.8) 
Z| = 1P = |g’| 
8i 8) ase 
Substitute (13.7) and (13.8) into (13.6), we have 
1(¥i, Yj) = H(V;) + H(Y;) — Eflog |J|] — H(X) (13.9) 
= —E [log py, (¥;)] — E [log py,(¥;)] — E[log|g‘|] — H(X), 
(13.10) 


where H (X) does not depend on the parameters in g; and g; and can be considered as 
constant. The minimization problem (13.10) is solved by gradient-descent methods, 
and the details of the optimization are skipped. 

In the second step, verify if the estimated disturbance Y; is independent from 
the assume cause Y; based on the statistical independence test. The kernel-based 
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statistical test is adopted with the significance level = 0.01 (Giga 2014). Denote the 
test statistic as testj,;. If testi>j > testj;, it indicates that Y; and Y; are not 
independent, that is X; —> X; does not hold. Repeat the above procedure with X; 
and X ; exchanged to verify if X; —> X; holds. If test;_.; < testj;, it concludes 
that X; causes X j. g; and g; provide an estimate of fj ı and f ae respectively. 

For a complex system, there are n process variables. Following a test sequence, 
Xı > X2, Xı > X3,..., Xn-1 > Xn, the N group statistics should be tested, 


_ n(n — 1) 


N=n+(n—-1)+(n—-2)+---+1 E 


(13.11) 

The total computation is in direct proportion to 2 x N. As the number of variables 
increases, the amount of computation will increase as well. The measured statistics 
in the positive order (or in the reverse order) are stored as 


A = [testx, >x, testy, sx, ..., testy, ,+x,], 
1> A2 1> 43 n-1 > An (13.12) 


B = [testx, >x, teStx,+X +++, testy, +x,_,]. 


Comparing the corresponding elements of the vectors A and B, the causal direc- 
tion of this pair of variables is determined according to the smaller statistic. Once the 
causality of all variables is found based on the above cyclic search, integrate them 
into a DAG. 


13.1.3 Network Parameter Learning 


The multivariate causality model gives a framework similar to the Bayesian net- 
work to find the internal structure of the complex systems. Its graphical structure 
expresses the causal interactions and direct/indirect relations as probabilistic net- 
works. Its parameter represents the intensity of the complex inter-relationships among 
the cause-effect variables. 

Consider a finite set U = {X,,..., Xn} of discrete random variables where each 
variable X; may take on several discrete status from a finite set. A Bayesian network is 
an annotated directed acyclic graph that encodes a joint probability distribution over 
a set of random variables U. Formally, the Bayesian network for U is constructed as 
a pair B =< G, O >. Gis a directed acyclic graph whose vertices is correspond to 
the random variables X,, ..., Xn. O is the parameters set that quantifies the network 
with ijk = p(x*)|pa; and `; 6:jx = 1, where xÝ is the discrete status of X; and 
paj is one of components in the complete parent set P A; of X; in G. Every variable 
X; is conditionally independent of its non-descendants given its parents (Markov 
condition). The joint probability distribution over set U is 
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P(X i,- Xn) =] [PRAPA = | [bxi ea. (13.13) 


i=l i=1 


The parameters of the causality Bayesian network are mainly learned from the 
statistics analysis of sample data. The maximum likelihood estimation method (MLE) 
is one of the most classical and effective algorithms in parameter learning. 

Give a data set D = {D;, ..., Dn} of all Bayesian network nodes. The goal of 
parameter learning is to find the most probable values for © . These values best explain 
the data set D, which can be quantified by the log likelihood function logp(D|@), 
denoted L p(@). Assume that all samples are drawn independently from the underly- 
ing distribution. According to the conditional independence assumptions, we have 


n qi fi 


Lp(0) = iog] [| [] [0 (13.14) 


i=] j=1 k=1 


where q; is the number of combinations of the parent nodes paj , r; is the number 
of the node X; status. n;;, indicates how many elements of D contain both x and 
pa! . If the data set D is complete, MLE method can be described as a constrained 
optimization problem, 


max Lp(0), 
is 13.15 
stg) = Y bij 1 =0, Yi =1,...,0, Yj =1,..., qi Sea 
k=1 
Its global optimum solution is 
Nijk 
bij = > (13.16) 


13.2 BCN-Based Fault Detection and Inference 


The complete monitoring model is established via combining the multivariate causal 
structure and the Bayesian parameters learning. The qualitative and quantitative 
relationships among the process variables are revealed to the greatest extent. Then 
this model is forward used to accurately predict the operation status and detect faults 
of the critical process variables (i.e., forward inference). Similarly, it also can be 
inversely used to find the source of the faults (1.e., backward inference). The overall 
block diagram of the proposed method is shown in Fig. 13.3. 

Causality network prediction or inference is to calculate the probability of the 
hypothesis variables at certain status according to the network topology and con- 
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ditional probability distribution of the evidence variable. An inference or query 
P(Q = q|E = eo) is to calculate the posterior probability of a query variable Q 
being at its specific value q in the condition of given evidence eg for node E. 

There are many existing network inference algorithms, such as variable elimina- 
tion algorithm and junction tree algorithm (JT). These algorithms utilize the hypoth- 
esis variables and specific independence relations induced by the evidence in BN 
to simplify the updating task. JT implements the inference procedure in four steps 
(Borsotto et al. 2006), 


1. Cluster the nodes into several cliques; 

2. Connect the cliques to form a junction tree; 
3. Propagate information in the network; 

4. Answer a query. 


The inference starts from a root clique. The core step of message propagation 
consists of a message collection phase and a distribution phase. The cliques of the 
junction tree are connected by separators such that the so-called junction tree prop- 
erty holds. When a message is passed from one clique X to another clique Y, it is 
mediated by the separate set S between the two cliques. Every conditional probabil- 
ity distribution of the original BN is associated with a clique such that the domain of 
the distribution is a subset of the clique domain (we use the notation dom(@) to refer 
to the domain of a potential @). The set of distributions @y associated with a clique 
X are in standard junction tree architectures combined to form the initial clique X. 


ox=[]¢. (13.17) 
dehy 


For a clique, a potential or a message is a mapping from the value assignments of 
the nodes to the set [0, 1.0]. A message pass from X to Y occurs with two procedures: 
projection and absorption based on the Hugin architecture (architecture is proposed 
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by Jensen et al. 1990). The projection procedure saves the current potential and 
assigns a new one to S: 


oy! <— ps andos — X` by. (13.18) 


X\S 


The absorption procedure assigns a new potential to Y using both the old and the 
new tables of S, 
Ps 
put’ 
s 


by <— hy (13.19) 


where œs is the current separator potential, gee is the old separator potential, dy is 


the clique potential for X, @y is the clique potential for Y. 

The query answering step has two procedures. First, the marginalization proce- 
dure calculates the joint probability of Q and E = eo : P(Q, E = e0) = >> x{Q} oy. 
Second, the normalization procedure calculates the inference result, 


P(Q=q,E = eo) 


P(Q=q|E = eo) = Lo P(O.E= 0) 


(13.20) 


The fault of operational variables is an intervention that has various effects on the 
production process. The main task in fault detection is to predict the system output 
and detect whether a fault occurs. The object of causal inference is to find the real 
root cause under the faulty intervention. 


13.3 Case Study 


In order to evaluate the performance of the proposed method, the experiment results 
are reported from three aspects: the causal direction identification of multi-variables, 
network parameter learning, and probability inference. 


13.3.1 Public Data Sets Experiment 


Four published data sets proposed by Mooij and Janzing (Leoand et al. 2001) are 
used to test the effectiveness of the nonlinear multivariate causal model. The cause- 
effect pairs are available at http://webdav.tuebingen.mpg.de/cause-effect/, which is 
considered as the benchmark for testing causal detection algorithms. The four data 
sets are (1) the ground altitude and temperature sampled at 349 stations, US; (2) 
census income data set which contains weighted census data extracted from the 
1994 and 1995 current population surveys conducted by the U.S. Census Bureau. 
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(a) data set 1 


(b) data set 2 
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A 


(c) data set 3 


(d) data set 4 


Fig. 13.4 Scatter plots of four data sets, a—d corresponding to data sets (1)—(4), respectively 


Table 13.1 Independence test statistics under different assumption of causal directions 


Causal assumption x—>y y>x 

#1 1.7 x 107° 6.5 x 107° 

#2 1.2 x 1074 6.7 x 1074 

#3 3.5 x 107° 8.1 x 10-7 

#4 2.2 x 107° 5.7 x 107° 
Table 13.2 Causal results of the public data sets 

Data sets #1 #2 #3 #4 


x: altitude 
y: temperature 


Data information 


x: population 
y: infant 
mortality rate 


x: age 
y: wage per hour 


x: age 
y: heart rate 


Real direction x->y x—>y x—>y x> y 
Test results x> y x—> y x> y x> y 
True or false True True True True 


The variables include age and wage per hour; (3) the attribute information (age and 
heart rate) from Cardiac Arrhythmia database; (4) the population with sustainable 
access to improved drinking water sources (%) total, and the infant mortality rate (per 
1000 live births) both sexes, 2006. Each data set consists of two random variables 
which their cause-effect relationship is known. The four data sets have different 
attributes, which is sufficient to show the general and comprehensive nature of the 
data. 

Figure 13.4 gives the scatter plots of the selected data sets (1)-(4). Table 13.1 
shows the statistics of independence test on x and y for data sets (1)-(4) under 
different assumption of causal directions. The statistics are calculated separately 
based on these different assumptions. Comparing the test statistics under two different 
assumption in Table 13.1, the causal direction of each set all are determined as x —> 
y. Table 13.2 summarizes the causal results obtained by the multivariate causality 
model. It is found that the test results are consistent to the real causal relationship. 
We can conclude that the proposed method can correctly identify the causal direction 
regardless the diversity of data. 
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Fig. 13.5 The network A Xs 
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13.3.2 TE Process Experiment 


In order to illustrate the applicability of the proposed method in the actual complex 
industrial process, the network topology of TE process is established and used to 
predict the alarm variables. TE platform simulates an actual chemical process, a 
detailed description of the TE process is given in Chap. 4. 

Experiment 1: Build Causal Structure 

In this experiment, eight important process variables are selected to calculate their 
causality in order to facilitate the result visualization. From the mechanism analysis 
of TE process, it is known that when the reactor feed X> increases, the material is 
first entered into the reactor, so the reactor level X4 must increase. So the reactor 
feed X> directly affects the reactor level X4. The temperature of cooling water Xg 
and the reactor feed X, are the main factors of affecting the reactor temperature 
X5. The reactor pressure X3 changes synchronized with the reactor temperature X 5 
according to the general physical principle. In addition, once the chemical reaction in 
the reactor is more intense, the compressor module power X7 will be synchronized 
to strengthen due to the sequential loop. At the same time, the reactor pressure X3 
also has an obvious influence on the recovered flow X, and the material level X6 
in the separator. Now the initial structure of the causality network is built based 
on the mechanism analysis (including the expert prior knowledge and the intuitive 
correlation analysis of process variable), named as BnetO shown in Fig. 13.5. 

The pre-defined fault is random variations in A, B, C compositions in stream 4. 
The corresponding data of eight variables are collected from the simulation platform. 
The reaction length is 700h to ensure that the data is sufficient to reflect the system 
process. 500 sampling data are obtained after the equal time decimating. The causal 
direction of the paired variables is shown in Table 13.3. Three different causality 
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Table 13.3 Causal direction of TE variables 


Variables information Statistic (positive/reverse) Causal direction 
X2: Reactor feed rate 5.7 x 1076/8.2 x 107° X2 > X5 

X5: Reactor temperature 

X5: Reactor temperature TAX 1076/2.9 x 107° Xs > X; 


Xg: Reactor cooling water 
outlet temperature 


X2: Reactor feed rate 3.4 x 1074/8.5 x 1074 X2 > X4 
X4: Reactor level 

X5: Reactor temperature 7.3 x 1074/9.2 x 1074 X5 > X7 
X7: Compress work 

X3: Reactor pressure 7.6 x 1075/4.5 x 1075 X5; > X3 
X5: Reactor temperature 

X3: Reactor pressure 2.9 x 10° /3.9 x 107° X3 —> X6 
X6: Product separator level 

X1: Recycle flow 6.6 x 1076/2.7 x 1076 X; > Xı 


X3: Reactor pressure 


@ Q ®© ® 
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(a) Bnet1 (b) Bnet2 


Fig. 13.6 The network compare: a Bnet1, b Bnet2, c Bnet3 


models are compared, including (1) Bnet1, the proposed multivariate post-nonlinear 
acyclic causal model, shown in Fig. 13.6a; (2) Bnet2, an alternative network obtained 
from the traditional BN structure learning method-K2 algorithm which needs to set 
the node order, shown in Fig. 13.6b; (3) Bnet3, the network structure learned with 
the expectation maximization (EM) algorithm, shown in Fig. 13.6c. 

Comparing the process analysis structure BnetO and Bnetl determined by the 
proposed Bayesian Causal Network, it is found that Bnet1 is exactly consistent to 
BnetO. The structure determined using the proposed method exactly matches the 
mechanism and expert knowledge, which indicates that the causal structure is credible 
and accurate. However, Bnet2 and Bnet3 learned from the traditional BN methods 
are not consistent with the mechanism. They show a big gap from the actual physical 
relationship. It demonstrates that the general BN learning method fails when it is 
applied to the complex nonlinear systems, while the proposed multivariate causality 
model proves its superiority. 
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Table 13.4 Threshold setting for alarm status in different variables 


Alarm | X, X2 X3 X4 X5 X6 X Xs 
status (km? /h) | (km?/h) | (kPa) (%) (°C) (%) (KW) (°C) 
1 <31 <46 < 2789 |< 62.5 < 122.7 |< 45 < 268 < 
102.25 
2 31-32 |46-47 |2789- |62.5- 122.7— | 45-47.2 |268- 102.25- 
2796 63.8 122.87 272.3 102.41 
3 32-33 | 47-48.3 |2796- | 63.8-66 | 122.87— |47.2— |272.3— |102.41- 
2802 122.93 |52.2 274 102.55 
4 33-34 |48.3- |2804— | 66-66.8 | 122.93— |52.2-53 |274-280 | 102.55- 
49.5 2809 123.2 102.7 
5 > 34 > 49.5 | > 2809 |> 66.8 | > 123.2 |> 53 > 280 |> 102.7 


Experiment 2: Parameter Learning Once the TE network structure is deter- 
mined, the alarm prediction model can be obtained by parameter learning of this 
causality structure network. In general, the process alarm event can be divided into 
five-alarm levels, namely, high-high alarm (HH), high alarm(H), normal(N), low 
alarm(L) and low-low alarm(LL), corresponding to the number 1,2,3,4,5. The first 
step is to discretize the continuous variables into five-alarm levels by setting different 
thresholds, shown in Table 13.4. 

Here the MLE algorithm is adopted to learn the network parameters and get a 
complete probability table. Suppose that the initial probability of the alarm level in 
the normal condition is theoretically divided equally. Then the conditional proba- 
bility values for all variables are calculated based on the BN parameter learning. 
Considering two root nodes X2 and Xg, their corresponding probabilities for five 
status are 0.0843, 0.2211, 0.4704, 0.2026 and 0.0217, respectively. The probability 
of other descendant variables as shown in Fig. 13.7. Hot plot is used to show the 
probability since the precise value has nothing meaning for the alarm prediction and 
inference. The color represents the probability range between 0 and 1. 

It should be concerned with the probability value of close to 1. These are the key 
points in determining the inference results. When the probability is less than 0.5, the 
result situation will not likely appear in the actual inference. Figure 13.7a shows the 
probability of X5 under the combined action of X, and Xg. The abscissa is the status 
condition of Xg and X3, and the ordinate is the probability value for five-alarm status 
of X5 displayed in corresponding color. P(X5 = 1|Xs = 1,2 and X2 = 1) % 1 in 
the lower left corner of Fig. 13.7a. It means that X5 occurs the low-low alarm with 
the probability close to 1 when X% and Xg are in the low-low alarm status. P(X5 = 
5|Xg = 4,5 and X2 = 5) ~ 1 in the upper right corner of Fig. 13.7a. It means that 
Xs occurs the high-high alarm with the probability close to 1 when X> and Xg are 
in the high-high alarm status. These inference results are consistent with the actual 
mechanism. 

Figure 13.7b—e reflects the probability relationship between bivariate variables. 
Figure 13.7b shows the probability of X4 under the action of X3. P(X4 = 5|X3 = 
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Fig. 13.7 Conditional probability of the descendant variables: a P(X5|Xg, X2), b P(X4|X3), € 
P(X3|X5), d P(X7|X5), e P(X1|X3), £ P(X6|X3) 
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Table 13.5 Alarm level prediction of compress work X7 


No. X> Xs X5 X7 £ Max Prob. 
1 1 2 1 2 1 0.4571 
2 2 1 2 1 1 0.6501 
3 1 2 2 2 2 0.7627 
4 2 1 2 2 2 0.6729 
5 1 2 2 1 1 0.6896 
6 3 3 2 3 1 0.8760 
7 3 3 2 3 3 0.6344 
8 3 3 3 2 2 0.8563 
9 3 3 2 3 2 0.3454 
10 2 3 3 3 3 0.5073 
11 3 3 3 2 3 0.4432 
12 3 2 3 3 3 0.5696 
13 4 3 4 4 3 0.3128 
14 3 4 4 4 4 0.6284 
15 4 5 5 5 5 0.7557 
16 4 3 4 4 5 0.3783 
17 5 5 4 4 4 0.7947 
18 4 5 4 4 4 0.8325 
19 5 4 5 4 5 0.6454 
20 5 4 4 5 5 0.8113 


5) © 1 in the upper right corner. It means that the probability of X4 occurs the 
high-high alarm with the probability close to 1 when X3 in the high-high alarm 
status. However, P(X4 = 1|X3 = 5) = 0 in the lower right corner. It means that X4 
occurs the high-high alarm with the probability close to 0 when X3 in the low-low 
alarm statue. P(X4 = 1 and X4 = 2|X3 = 2) © 0.5 in the green area. It means the 
probability of X4 occurs the low alarm or low-low alarm almost same when X3 in 
the low alarm status. Similarly, the inference results obtained from Fig. 13.7c—e are 
consistent with the mechanism. 

Experiment 2: Alarm Prediction Alarm prediction is a top-down inference 
according to the evidences inference conclusion. The probabilistic analysis calcu- 
lates the likelihood of each status for the result variable may occur. The discrete 
status corresponding to the maximum probability is the alarm prediction result. 

Using the established multivariate causality network model, compress work X7 is 
predicted when its parent variables X2, Xg and X5 are known. The prediction results 
for model Bnet! are shown in Table 13.5, where Xx 7 is the prediction value of X7. 

The total prediction accuracy for the 20 simulation experiments is 75%. When 
the maximum probability of the predicted value is greater than 0.5, the prediction 
result is confident. Furthermore, the predictions with a high probability is consistent 
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with the true status. When the maximum probability of the predicted value is less 
than 0.5, the prediction result is not believable and accurate. The mis-predictions 
confuse between the adjacent status, such as the normal status 2 and Low alarm 3 (or 
high alarm 2). The simulation results show that the multivariate causality network 
can find the intrinsic relationships among various process variables, and give precise 
fault or alarm prediction. 


13.4 Conclusions 


This chapter proposes a multivariate causality model to analyze the causal direction 
of multivariable and final determine the network topology. The proposed method 
can describe the system structure more accurate than the traditional BN structure 
learning method especially when the industrial process is high complex. Combined 
with the network parameters learning and evidence inference technique, an accurate 
monitoring and alarm prediction can be performed. The validity of the proposed 
method is verified via the public data set and TE process. An compact network 
structure and confident alarm prediction are obtained for the TE process based on the 
causal analysis and probability inference. Both the methodology and the simulation 
results show that the proposed multivariate causality model has great value for the 
process industry modeling and monitoring. 

There are some issues worth further discussion. The computing efficiency of 
the proposed multivariate post-nonlinear acyclic causal modeling method should be 
considered when solving the large-scale causal analysis problems in the real world. 
Developing the efficient algorithm to find the causal relationship of multiple variables 
based on the general functional causal models is still an important topic. To improve 
the computational efficiency, a feasible solution is to limit the complexity of the 
causal structure, such as decreasing the number of direct causes of each variable. 
Moreover, a smart optimization procedure instead of the exhaustive search should 
be considered further. 
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Chapter 14 A) 
Probabilistic Graphical Model for get 
Continuous Variables 


Most of the sampled data in complex industrial processes are sequential in time. 
Therefore, the traditional BN learning mechanisms have limitations on the value of 
probability and cannot be applied to the time series. The model established in Chap. 13 
is a graphical model similar to a Bayesian network, but its parameter learning method 
can only handle the discrete variables. This chapter aims at the probabilistic graphical 
model directly for the continuous process variables, which avoids the assumption of 
discrete or Gaussian distributions. 

This chapter expands the previous work in Chap. 13 from the random discrete vari- 
ables to the random continuous variables. In addition to enhancing the effect of causal 
structure and parameter learning on the continuous variables, kernel density estima- 
tion is used to construct the node association strength of the causal graph network 
in the form of probability density. The conditional probability density is obtained 
from the mathematical operation between the low-dimensional probability density 
and the high-dimensional joint probability density. This non-parametric estimation 
method directly estimates the probability density of continuous variables and avoids 
the limitations of traditional Gaussian assumptions. Moreover, this chapter strictly 
derives the evaluation indicators for the KDE estimation quality. The proposed causal 
learning mechanism does not have any restrictions, such as linear, nonlinear, or dis- 
tribution functions. It establishes an accurate causal probability graphical model to 
detect faults and find the root cause of the fault. 


14.1 Construction of Probabilistic Graphical Model 


14.1.1 Multivariate Casual Structure Learning 


The first step of building a graphical model is to construct a causal topological rela- 
tionship. The causal hypothesis model is a post-nonlinear model. It can determine the 
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causal relationship between multiple variables through hypothesis testing. Detailed 
information can be found in Chap. 13 (Chen et al. 2018). 

Consider a model which represents the causal relationship between variables. 
Here a generative model is used to explain the data generation process. When the 
existing mechanism of the data model cannot be determined, the hypothetical model 
should be sufficiently versatile so that it can be adapted to approximate the actual 
data generation process. In addition, the model should be identified so that cause and 
effect can be distinguished. 

In order to discover the causality of multiple variables in a complex system, a 
more generalized multivariable nonlinear acyclic causal model with internal addi- 
tive noise is given same as Chap. 13. The model adopts the form of graph theory and 
Bayesian network structure. Assume that a directed acyclic graph (DAG) represents 
the relationship between multiple observed variables. Select a pair of variables X; 
and X;,i, j = {1, 2, ... , n} from the system, respectively. If X; is X ;’s parent node 
and its data generating process is described in a post-nonlinear(PNL) mixing model. 
The generation process of X; is X; = fj,2 (fja (Xi) + ej), where f; ı denotes the 
nonlinear effect of the causes, and f; 2 denotes the invertible post-nonlinear distortion 
in variable X;. e; is the independent disturbance. Here it is applicable to a combina- 
tion of hypothesis testing and nonlinear independent component analysis (ICA) to 
solve this problem (Shimizu et al. 2011). To describe in simplified language, it can 
be divided into two steps: 


1. The nonlinear ICA method with constraints is used to calculate the interference 
ej corresponding to the assumed causality X; > Xj; 

2. The statistical independence test is used to determine the independent relationship 
between the estimated interference e; and the assumed cause X;. 


For any pair of variables in the system, two causal assumptions can be made. 
The causality is assumed positive and negative, and the direction of the causality is 
determined by comparing the statistical information obtained by calculation. After 
n (n — 1) hypotheses and tests, the causality of all system variables is determined 
finally. Therefore, this multivariate nonlinear acyclic causal modeling method will 
not have the limitation of Bayesian network structure learning. It can effectively 
establish the causal structure of the process. 


14.1.2 Probability Density Estimation 


Section 14.1.1 completed the construction of the causal structure of the model. The 
complete graph model also should include the quantitative relationships between 
nodes which is described as probabilistic connection of nodes here. The probability 
density of the node variable is determined by the non-parametric probability den- 
sity estimation method. Because the child node is affected by its parent node, the 
probabilistic connection relationship manifests itself in the conditional probability 
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density. Kernel Density Estimation (KDE) is a prominent method to estimate the 
non-parametric probability density. The explicit form of the density function is the 
main advantage of KDE method (Chen et al. 2018). 

Let X1, X2, X3,..., Xn be a set of samples of the random variable X. Its density 
function f(x), x € R, X is unknown. The distribution density function f(x) can be 
derived from its corresponding cumulative distribution function F(x), 


dF(x) _ Fa+h)— F(x -—h) 


14.1 
dx 2h al 


fœ) = 


where h > 0 is the window width. The empirical distribution function F,„(x) = 
1 X; I (X; < x) is used to estimate F(x). Substitute it into (14.1), 


a dF) _ Fa+h)—FO-hA) 
Naar Fa 2h 
1 
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(14.2) gives the KDE for f(x) with a window width h and a kernel function 
Ko = 41 (lul < 1). 
The more general kernel density estimate is 


n 


A 1 Xi—x 
fay= OK , k (14.3) 


where Ô (x) gives the estimate of the probability density function. n, h, K are the 
number of samples, window width and kernel function. 
Conditional probability density calculation requires additional mathematical 


operations. Similarly, consider two random sample sets X1, X2, X3,..., Xn and 
Y1, Yo, Y3, ..., Yn, where X is cause variable and Y is effect variable. The joint 
probability density of x and y is defined as 
7 Las. 1 x— Xi y-Y; 
y= K ; ; . 
fy) PR ya ( hi > h ) oe 


where hı and h3 are the window width corresponding to the cause variable x and the 
effect variable y, respectively. 

According to the definition of conditional probability, the conditional density 
Ff (y|x) is obtained as follows: 
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Table 14.1 Common kernel functions 


Number Kernel function Expression 
1 Uniform 51 (\u| < 1) 
2 Triangle (1 = lu) (lul < 1) 

: I _1)2 
3 Gaussian Tay CAP ( xh) 
4 Epanechnikov NETUS 

fœ, y) 
fol) =. (14.5) 
f(x) 


The kernel function here affects the precision of kernel density estimation. How 
to select an appropriate kernel function is an important issue. Usually, the follow- 
ing properties should be considered: symmetry, non-negative, and normality (Zeng 
et al. 2017). The mathematical description of common kernel functions is given in 
Table 14.1(Jiang and Nicholas 2014). 

It can be seen from the KDE expression that the kernel function K, sample size n 
and its window width h are the main contributing factors of f (x). Once the number 
of samples n is fixed, K and h directly affect the accuracy of the system model 
parameters. Furthermore, the effectiveness of fault detection and root cause diagnosis 
will fluctuate directly. Therefore, in order to estimate the probability density more 
accurately and improve the estimation quality of KDE, a KDE evaluation criterion 
is given in the next section. There are already data showing that the choice of kernel 
function has a negligible effect on the result of kernel density estimation (Silverman 
1998), so the optimization of K is not considered here. 


14.1.3 Evaluation Index of Estimation Quality 


According to the definition of kernel density, consider the following two cases: (1) the 
value of the window width A is very large. The average compression transformation 
ae can remove the local details of the probability density function, which results in 
the smoothness of probability density estimation curve. A relatively low resolution 
is shown at this case, and the estimation deviation is enlarged; (2) the value of the 
window width is very small. On the contrary, the influence of the randomness of 
probability density will increase, and the important characteristics of density will 
be masked. It causes the larger fluctuation of density estimation and the stability is 
easy to be deteriorated. The estimation variance is too large at this case (Jiang and 
Nicholas 2014). 

The requirements about the accurate estimation include much closer to the true val- 
ues and remaining stable for different observations. These two attributes are described 
by the estimated deviation and variance which are given as 
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Bias{ f (x)} = EL f(x)] — f@) 
Var{ f (x)} = ELf (x) — [E f&I. 


(14.6) 


The probability density function of the child nodes in the causal model is affected 
by the parent nodes. Its probability density usually is multidimensional. Consider a 
two-dimensional kernel density function f(x, y) as an example. Its deviation and 
variance are 


Bias f(x, »)} = E[ f@.)]- fa. y) 
a A 2 7 2 
var fo, y}=E[ fe - [Efe »] 


(14.7) 


Here the mean square integral error (MISE) is introduced as the evaluation index 
of KDE. The MISE index has an unique advantage to evaluate the difference between 
the estimated function and the true function. At the same time, it also guarantees the 
fitness and smoothness of kernel estimation. 

One-dimensional MISE is defined as 


A P 2 
MISE[ f(x)] = f [feo - reo] dx. (14.8) 


Two-dimensional MISE is defined as 


A A 2 
miseifix. y) =E ff [fæ fon») axdy. 4o 


The above MISE indices are simplified as, and the details can be found from the 
supporting information in Chen et al. (2018), 
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(14.11) 


It is found from (14.10) and (14.11) that the values of the functions f K? (t)dt 
and f t° K (t) dt are related to the kernel function K. They are not difficult to calcu- 
late if the mathematical expression of kernel function is substituted into the above 
equations. Generally speaking, window width h has a greater impact on MISE value, 
so optimizing h is critical. Here (14.10) and (14.11) are also used as optimization 
objectives to find the best window width h. 
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For one-dimensional probability density, let d (MISE [ f œ]) /dh = 0. Then 
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If the kernel function is predetermined, ee = C(k) is aconstant. Usually 


the true probability density functions f(x) and f(x, y) are unknown. The estimated 
probability density function (14.3) and (14.4) are substituted into (14.12) and (14.14), 
respectively. Then the optimal parameter h for one-dimensional estimation or hı and 
hz for two-dimensional estimation are obtained. 


14.2 Dynamic Threshold for the Fault Detection 


Generally speaking, the process variables show obvious difference in their measure- 
ments in the normal operation and faulty operation. Then the measurement difference 
must be reflected in the probability density distribution. System failure detection is 
to find their differences based on the appropriate thresholds. Here, it is not feasible 
to use the confidence interval of the normal state to directly distinguish the fault. The 
actual process data are usually accompanied by a lot of noise, the distribution is not 
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ideal even in the normal operation. Therefore, its confidence cannot be completely 
described as a constant horizontal line. The constant confidence line is further diffi- 
cult to distinguish the normal operation and the fault operation. Therefore, the idea of 
dynamic threshold is introduced. Fused Lasso (FL) method is common to denoise in 
the field of signal processing. Here it is used to design the dynamic confidence limits. 
It can provide the required reasonable range for each node based on the normal data. 

The Fused Lasso Signal Approximator (FLSA) aims at eliminating noise and 
smoothing data (Bensi et al. 2013). The real-valued observations y = Gx is obtained 


by finding the sequence 64, ..., By that minimizes the criterion, 
i” N N 
Jeu = 5 2 Or Bx? + YB +2) Ge - Beal (1415) 
k=l k=l k=2 
where A; and Az are tuning parameters, x;,..., xy is the feature variables. The 


objective of Jr, consists of three parts: 5 yo ( Ye — Byx i) is the traditional index 
of the least squares algorithm. It strives for the regression accuracy of the model for all 
the existed measurements [x;, yg]. The last two parts A, eer [By| + A2 ae lB; — 
(,_1| encourages the sparsity of regression coefficients and their differences. The 
parameters A; and Àz are adjusted to trade-off the regression accuracy and denoising 
power. (14.15) is totally a denoising problem if à; = 0. 

Here the hidden Markov model (HMM) and the maximum likelihood estima- 
tion method are used for optimization calculation. The HMM posits an emission 
probability Pr ( Yk Bx) that is a standard normal distribution, and a transition proba- 
bility Pr (By 41l Bx) that is double exponential with parameter 2 (where Pr denotes 
probability). 

The Viterbi algorithm is a typical dynamic programming algorithm for this HMM 
problem, which the detailed description be found in (Rabiner et al. 1989). The objec- 
tive function (14.15) is rewritten as maximization in a more general form, 


N 


N 
Jer = X ex (By) — 2 Yd By, By), (14.16) 


k=1 k=2 


where e,(b) = a y;,V; (BD). 
Denote the variable sequences (x1, ¥2,..., Xg) as the shorthand x ;.,. Rewrite the 
criterion (14.16) as follows: 


N N 
p ex (By) — à2 Y\d(By. Bio] 


k=1 k=2 


Jr = max 
1:N 


AE x (14.17) 
= max[en (By)] + max È elbi) — A2 X d(By, a..0| 
N 1:(N—1) k=1 


k=2 


and 
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N-1 


N 
fu(Bn) Eo È ex (Bk) — ` X d(Bx, Buo] 


k=1 k=2 
=maxlen—1(9y-1) + rA2d (By, Bn_1)] (14.18) 


N-2 N-1 
+ max 2 ex (By) — à2 X d(By. Bo] - 


BiN-) k=1 k=2 


The definitions of functions fy_;(By_1), fy-2(Bny_2),---, J2 (B2) are similar 
to fy (Gy). The maximization problem is solved further iteratively. It is summarized 
by introducing the intermediate functions with k ranging from 2 to N, 


ôi (b) := e, (b) 
V(b) := arg max[dy1(b) — Mlb — B1 


feb) := k1 (x (B)) — Azlb — Y ()| 
ôk = ex (b) + f(b). 


(14.19) 


The functions %(-) take part in the backward pass of the algorithm. This back- 


ward pass computes 3,,..., By through a recursion identical to that of the Viterbi 
algorithm for HMMs: 


A 


By = arg max{dy(b)} 


x s (14.20) 
Br = Prti (Bei) for k=N-1,N—2,...,1. 


So far, the above FL theory is implemented to obtain the dynamic threshold of 
the data model. During the process of fault detection, the KDE estimated probability 
values are the input variable of the FLSA algorithm for smoothing. The influence of 
data noise on the estimated probability density function is eliminated and a credible 
threshold is found to distinguish the normal operation and the faulty operation. 


14.3 Forward Fault Diagnosis and Reverse Reasoning 


Detailed theoretical supports have been supplemented enough in last section, includ- 
ing the construction of probability graph models, the selection of probability den- 
sity estimation evaluation indicators and parameter optimization, and the setting of 
dynamic thresholds for fault detection. The established model structure is determined 
by the causal direction between operating units, which represents the qualitative rela- 
tionship between nodes. The non-parametric KED estimation is used to obtain the 
parameters of the graph model, i.e., the causal probability relationship. Probability 
can quantitatively describe the dependence between process variables. The evalu- 
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Fig. 14.1 The overall framework 


ation index of the probability relationship estimation is derived and calculated to 
ensure the accuracy of the graphical model. 


Now this section combines and implements the above theoretical methods into 


a certain fault detection and diagnosis framework, which can be used to diagnose 
abnormal events in the system and locate the root cause of the fault. The overall 
framework of the proposed method is represented in Fig. 14.1. 


The main steps for fault detection and root tracing are summarized based on the 


detail flow chart in Fig. 14.2, 


1. 


2. 


Construct a cause-effect network structure for the selected process variables from 
the industrial process; 

List all the probability density functions that need to be estimated, including 
one-dimensional densities for root nodes, multidimensional joint densities, or the 
corresponding conditional probability densities for child nodes; 

Estimate the (conditional) probability densities of each node based on KDE 
method; 

Calculate the dynamic threshold for the health status of each node by input all the 
density values to FLSA; 

Collect test data and detect whether faults occur compared with the dynamic 
threshold; 

Reverse reasoning based on the graph model in the case of failure. Starting from 
the faulty node, check which parent nodes of the faulty node is faulty in turn. 
Remove all non-faulty parent nodes and clarify the fault propagation path until 
the fault root is found. 
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Fig. 14.2 Flowchart for detecting and tracing faults 


Fig. 14.3 The casual 
structure of partial TE 
process 


14.4 Case Study: Application to TEP 


The proposed methods are verified on Tennessee Eastman (TE) process simulator. 
The TE process contains a total of 52 process variables and measurement variables. 
Eight variables in the reactor module are selected to test the causal structure, same as 
Chap. 13. The physical meanings of these variables are listed in Table 14.2. According 
to the causal analysis method, it is not difficult to obtain the causal relationship 
between eight variables (the detail analysis also can be found in Chap. 13). The 
corresponding topology is shown in Fig. 14.3. 
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Table 14.2 Process manipulated variables 


Variable (Symbol in the Fig. 3) | Physical meaning Units 
x (U5) Recycle flow km? /h 
x2(v6) Reactor feed rate km? /h 
x3(v7) Reactor pressure kPa 
x4(vg) Reactor level % 
x5(U9) Reactor temperature °c 
X6(V12) Product separator level % 
x7(v29) Compress work KW 
xg(v21) Reactor cooling water outlet °C 
temperature 


List all the probability density function and conditional probability density of 
nodes in the causal graph. In total, f(x2), f(xs), f(x4|x2), f(xs|xs), f(x7|x5), 
Ft (x3|x5), f(x1|x3), f (x6|x3) need to be estimated. Here the root nodes x2 and xg 
have one-dimensional probability density function. Optimize the window width h to 
obtain an accurate probability estimate. 

The training data set contains 960 samples in the normal operation. These data are 
used to obtain the KDE of the model. Combine the causal structure constructed in the 
previous step to get a complete graphical model. Fault IDV(4) is a minor fault which 
is used as a test sample to verify the effectiveness and sensitiveness of the proposed 
method to minor faults. The fault IDV(4), a step change of the reactor cooling water 
inlet temperature, is introduced in the middle of the reaction. Then 960 samples are 
obtained as the testing data set, in which the first 480 samples are normal and the 
following 480 are faulty data. 

In order to be able to trace the root cause of the fault, the child nodes must be 
selected here to test the fault. Randomly select one of the child nodes x7 of the graph- 
ical model as the experimental object. According to the causal structure, it is easy to 
see that x7 is directly related to x5. Here xs is the parent node of x7, so first calculate 
the conditional probability density f(x7|x5). Figure 14.4 gives the graphical repre- 
sentation of the probability relationship between these two variables. Figure 14.4a 
depicts the probability density of normal data and fault data as a function of sampling 
time. Based on the fusion lasso method, the obtained KDE estimation is used as a 
rough signal for denoising and restoration. The crossed line in Fig. 14.4b represents 
the KDE recovered after denoising, which is set as the dynamical threshold. It can 
be clearly seen that after about 480 samples, the conditional probability of x7 has 
exceeded the normal limit. Based on the FLSA method, the obtained KDE estimation 
is used as a rough signal for denoising and restoration. 

Fault tracing refers to finding the root cause of failure in x7. The existing graph 
model can clearly show the causal relationship between nodes, so the propagation 
path of the fault can be easily analyzed. Carry out the reverse reasoning based on 
the established causal structure parameter model. Start from the failure variable and 
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Fig. 14.4 Conditional probability of x7 under x5 


calculate the probability density function of its parent node in turn. The probability 
density curves obtained under normal and fault conditions are compared to determine 
whether the variables on each path are faulty. Continue this step until finding the root 
cause of the failure. In order to conversely infer the roots of fault x7, it is necessary to 
calculate f (x5|xg), f (%5|x2), f (x2), f (xg) separately. Simulation results are shown 
in Fig. 14.5. 

From the detection result graph, the true propagation path of the fault can be 
analyzed. The test shows that the root of the fault is xg. Corresponding to the physical 
meaning of the variable, the root cause is the temperature of the cooling water, and 
fault IDV(4) is a step change in the temperature of the cooling water. The result is 
consistent with the actual process. 
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14.5 Conclusions 


This chapter proposes a probability graph model directly for the continuous pro- 
cess variables aiming at the fault detection and root tracing. The model structure is 
determined by the causal relationship, and the probability relationship in the model 
is determined by the KDE method. For the child nodes in the causal structure, i.e., 
variables affected by other nodes, the conditional probability density functions are 
calculated based on the multidimensional joint probability density and the low- 
dimensional probability density. It reflects the strength relationship of the causal 
connection between the variables. An MISE index is rigorously derived to evalu- 
ate the estimation accuracy of KDE and optimize the KDE parameters. A dynamic 
threshold is constructed based on the FLSA algorithm to check the change of prob- 
ability density, further to detect the fault. The experiment results in the TE process 
show that the proposed method not only accurately detects the occurrence of the 
failure, but also succeeds in finding its root cause. 
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